-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LRU eviction with 1gb memory limit for PandasData #392
Conversation
…age_limit method.
… to help future sleuthing.
…_data_from_polygon will return them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review is in progress 📖 - I will have feedback for you in a few minutes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed your code and found 16 potential issues. To discuss my individual comments that I have added, tag me in replies using @korbit-ai-mentor.
Please react with a 👍 to my comments that you find helpful and a 👎 to those you find unhelpful - this will help me learn and improve as we collaborate.
# ./lumibot/build/__editable__.lumibot-3.1.14-py3-none-any/lumibot/entities/data.py:280: | ||
# FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. | ||
# Call result.infer_objects(copy=False) instead. | ||
# To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lumibot/tools/polygon_helper.py
Outdated
# Check if the dataframes are the same | ||
if df_all.equals(df_feather): | ||
return | ||
def update_cache(cache_file, df_all, missing_dates=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lumibot/tools/polygon_helper.py
Outdated
missing_dates = get_missing_dates(df_all, asset, start, end) | ||
update_cache(cache_file, df_all, missing_dates) | ||
|
||
# TODO: Do this upstream so we don't have to reload feather repeatedly for known-to-be-missing bars. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lumibot/strategies/strategy.py
Outdated
@@ -1859,7 +1859,7 @@ def get_last_price(self, asset, quote=None, exchange=None, should_use_last_close | |||
""" | |||
|
|||
# Check if the asset is valid | |||
if asset is None or (type(asset) == Asset and asset.is_valid() is False): | |||
if asset is None or (isinstance(asset, Asset) and not asset.is_valid()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in the condition to check the asset's validity is good as it uses isinstance() for type checking, which is more Pythonic. However, ensure that the Asset class has an is_valid() method implemented. If the method is not implemented, this change could lead to a runtime error when calling asset.is_valid().
lumibot/entities/bars.py
Outdated
@@ -171,7 +171,7 @@ def get_last_price(self): | |||
float | |||
|
|||
""" | |||
return self.df["close"][-1] | |||
return self.df["close"].iloc[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using .iloc[-1] to access the last element of a pandas DataFrame is a good practice, it's important to ensure that the DataFrame is not empty. If the DataFrame is empty, this will raise an IndexError. It would be good to add a check to ensure the DataFrame is not empty before attempting to access the last element.
elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo.utcoffset(dt_in) is None): | ||
# TODO: This will fail because an exception is thrown if tzinfo is not None. | ||
return LUMIBOT_DEFAULT_PYTZ.localize(dt_in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code attempts to localize a datetime object that already has a timezone but no UTC offset. However, the comment indicates that this will fail because an exception is thrown if the datetime object's tzinfo attribute is not None. This could lead to unexpected behavior or errors. It would be better to handle this case properly, either by converting the datetime object to a naive datetime before localizing it, or by handling the exception in some way.
tests/test_polygon_helper.py
Outdated
|
||
# Polygon is only called once for the same date range even when they are all missing. | ||
mock_polyclient().get_aggs.return_value = [] | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
if df is None: | ||
df = pd.DataFrame() | ||
assert len(df) == 0 | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
], | ||
] | ||
mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
assert mock_polyclient().get_aggs.call_count == 3 | ||
assert expected_cachefile.exists() | ||
assert len(df) == 7 | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
|
||
# Polygon is only called once for the same date range even when they are all missing. | ||
mock_polyclient().get_aggs.return_value = [] | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
if df is None: | ||
df = pd.DataFrame() | ||
assert len(df) == 0 | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
], | ||
] | ||
mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_polygon_helper.py
Outdated
assert mock_polyclient().get_aggs.call_count == 3 | ||
assert expected_cachefile.exists() | ||
assert len(df) == 7 | ||
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lumibot/tools/polygon_helper.py
Outdated
df_concat = pd.concat([df_all, missing_df]).sort_index() | ||
# Let's be careful and check for duplicates to avoid corrupting the feather file. | ||
if df_concat.index.duplicated().any(): | ||
logging.warn(f"Duplicate index entries found when trying to update Polygon cache {cache_file}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lumibot/entities/data.py
Outdated
@@ -519,7 +524,7 @@ def get_bars(self, dt, length=1, timestep=MIN_TIMESTEP, timeshift=0): | |||
if data is None: | |||
return None | |||
|
|||
df = pd.DataFrame(data).set_index("datetime") | |||
df = pd.DataFrame(data).assign(datetime=lambda df: pd.to_datetime(df['datetime'])).set_index('datetime') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge remote-tracking branch 'origin/pandas_cache' into pandas_cache
@jimwhite Does this improve Backtesting performance? And if so, by how much approximately? |
Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds). |
Merge branch 'dev' of https://github.com/fovi-llc/lumibot into pandas_cache
@@ -37,7 +37,7 @@ def __init__( | |||
# RESTClient API for Polygon.io polygon-api-client | |||
self.polygon_client = RESTClient(self._api_key) | |||
|
|||
def update_pandas_data(self, asset, quote, length, timestep, start_dt=None): | |||
def update_pandas_data(self, asset, quote, length, timestep, start_dt=None, update_data_store=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please add to the docstring details about "update_data_store" and what it does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me know when you add this and i can merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added. I've also changed the name to _update_pandas_data to reflect its internal intent and will alert me if someone has a problem with that.
|
||
return pandas_data_updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is pandas_data_updated no longer being returned? are you sure this won't break anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only references to update_pandas_data in the library are in this file and it looked to be an internal method to me. The purpose of the method seems to be about the side effect of updating PandasData.pandas_data and (sometimes) PandasData._data_store. If update_pandas_data is part of the API I can certainly put the return values back in.
My question is why do we have both _data_store and pandas_data in the super class PandasData? They hold nearly the same data but there is some kind of staging effect (i.e. out of sync behavior) whose top-level is PandasData.load_data. For big backtests that doubling of RAM usage is a significant cost and I don't know what it is for (although as I mentioned elsewhere I haven't read the docs...). If the two copies are needed and the selective updating has a purpose then I'll include that when adding a docstring to update_data_store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense, in that case I think you're right to re-arrange it like that.
As for your question about _data_store vs pandas_data: I believe one of them is used for inputting data at the very beginning, while the other is used at the actual data store. I think you're right that we do not need both and would probably be better off removing one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, yes. After looking at the code some more, even though the name is update_pandas_data it seems the actual purpose is to update _data_store.
AFAICT pandas_data is only used by load_store. It could be the idea is to do get_X calls in order to populate pandas_data before load_data. But that doesn't seem to make a lot of sense because the only place that updates both is PolygonBacktesting.get_last_price.
How about I rename update_pandas_data to update_data_store and drop the option to update pandas_data. That way if someone tries to upgrade and this would break them then they'll know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind. I had it backwards.
The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):
lumibot/lumibot/strategies/_strategy.py
Line 215 in 0599fd7
self.broker.data_source.load_data() |
def load_data(self): |
It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.
I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to do no harm I've made the default value of PolygonDataBacktesting.MAX_STORAGE_BYTES be None. I'm concerned that I don't understand what's happening with the PolygonDataBacktesting use of pandas_data, _data_store, and load_data. This change shouldn't have any effect (apart from the rename of update_pandas_data to _update_pandas_data which I expect isn't used by apps) unless the user enables it by assigning a suitable value to MAX_STORAGE_BYTES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind. I had it backwards.
The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):
lumibot/lumibot/strategies/_strategy.py
Line 215 in 0599fd7
self.broker.data_source.load_data()
def load_data(self): It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.
I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().
Yes that sounds correct, if my memory serves right, pandas data is only used on the initial load whereas data store is used on an ongoing basis. I sent you a link to my calendar for us to speak (https://calendly.com/lumi-rob/30min). This seems sufficiently complicated to justify a zoom call
That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost |
if pandas_data_update is not None: | ||
# Add the keys to the self.pandas_data dictionary | ||
self.pandas_data.update(pandas_data_update) | ||
self.update_pandas_data(asset, quote, 1, timestep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is update_data_store=False here? shouldnt we also update the data store here if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why only pandas_data is updated in three places and _data_store is updated in one.
if pandas_data_update is not None: | ||
# Add the keys to the self.pandas_data dictionary | ||
self.pandas_data.update(pandas_data_update) | ||
self.update_pandas_data(asset, quote, length, timestep, start_dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is update_data_store=False here? shouldnt we also update the data store here if needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above. Would be happy to hear from or chat with someone who knows how this is supposed to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimwhite I'd be happy to jump on a call with you to walk through this. You can pick a time on my calendar here: https://calendly.com/lumi-rob/30min
Whether limiting the _data_store size will speed things up depends on how much RAM a process is using vs the physical RAM. So if those strategies have Python processes that are taking dozens of GB on a typical 16 ~ 32 GB machine then I expect the answer will be yes. I've not run anything with options yet so I don't know what the memory usage looks like. By not duplicating the data into pandas_data (which only happens for get_last_price) as discussed above the usage will be half so I could increase the default MAX_STORAGE_BYTES to 2gb for the same footprint. As my follow up comment above says, it seems the pandas_data and _data_store thing is related to load_data being called for options strategies. So if options strategies depend on having all of their data loaded at the start of the backtest then eviction won't help and in fact would hurt so I'll need to make sure it doesn't happen in that case (if that is how it works). |
Excited to try this, merging now |
Use OrderedDict to implement simple LRU cache of the PandasData.pandas_data and ._data_store in-memory cache dict.
Fixes #391 (OOM when using PolygonBacktesting with many (100+))
I chose 1gb and the memory limit (configurable of course) because it WFM.
This PR includes upstream changes from the #387 (Fix repeated bar fetches from Polygon along with some minor nits) and #388 (Change type to asset_type in pandas_data.get_asset_by_symbol) PRs. I realize that makes this look messy but that's the state of this branch because that's how I'm running and testing this code. When those other PRs are merged then this will be easier to review.
The relevant changes here are just the ones for the lumibot/data_sources/pandas_data.py and lumibot/backtesting/polygon_backtesting.py file.