-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LRU eviction with 1gb memory limit for PandasData #392
Changes from all commits
e248db3
93744e2
83cf721
202d761
6cbd2f4
05c9087
84742f7
59c7dfc
31a9e52
d0f2fc3
957b69c
ab5f892
f5dadde
5b7725d
63fb704
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -268,6 +268,11 @@ def trim_data(self, df, date_start, date_end, trading_hours_start, trading_hours | |
) | ||
return df | ||
|
||
# ./lumibot/build/__editable__.lumibot-3.1.14-py3-none-any/lumibot/entities/data.py:280: | ||
# FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. | ||
# Call result.infer_objects(copy=False) instead. | ||
# To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` | ||
Comment on lines
+271
to
+274
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
def repair_times_and_fill(self, idx): | ||
# Trim the global index so that it is within the local data. | ||
idx = idx[(idx >= self.datetime_start) & (idx <= self.datetime_end)] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -142,7 +142,10 @@ def to_datetime_aware(dt_in): | |
"""Convert naive time to datetime aware on default timezone.""" | ||
if not dt_in: | ||
return dt_in | ||
elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo is None or dt_in.tzinfo.utcoffset(dt_in) is None): | ||
elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo is None): | ||
return LUMIBOT_DEFAULT_PYTZ.localize(dt_in) | ||
elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo.utcoffset(dt_in) is None): | ||
# TODO: This will fail because an exception is thrown if tzinfo is not None. | ||
return LUMIBOT_DEFAULT_PYTZ.localize(dt_in) | ||
Comment on lines
+147
to
149
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code attempts to localize a datetime object that already has a timezone but no UTC offset. However, the comment indicates that this will fail because an exception is thrown if the datetime object's tzinfo attribute is not None. This could lead to unexpected behavior or errors. It would be better to handle this case properly, either by converting the datetime object to a naive datetime before localizing it, or by handling the exception in some way. |
||
else: | ||
return dt_in | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is pandas_data_updated no longer being returned? are you sure this won't break anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only references to update_pandas_data in the library are in this file and it looked to be an internal method to me. The purpose of the method seems to be about the side effect of updating PandasData.pandas_data and (sometimes) PandasData._data_store. If update_pandas_data is part of the API I can certainly put the return values back in.
My question is why do we have both _data_store and pandas_data in the super class PandasData? They hold nearly the same data but there is some kind of staging effect (i.e. out of sync behavior) whose top-level is PandasData.load_data. For big backtests that doubling of RAM usage is a significant cost and I don't know what it is for (although as I mentioned elsewhere I haven't read the docs...). If the two copies are needed and the selective updating has a purpose then I'll include that when adding a docstring to update_data_store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that makes sense, in that case I think you're right to re-arrange it like that.
As for your question about _data_store vs pandas_data: I believe one of them is used for inputting data at the very beginning, while the other is used at the actual data store. I think you're right that we do not need both and would probably be better off removing one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, yes. After looking at the code some more, even though the name is update_pandas_data it seems the actual purpose is to update _data_store.
AFAICT pandas_data is only used by load_store. It could be the idea is to do get_X calls in order to populate pandas_data before load_data. But that doesn't seem to make a lot of sense because the only place that updates both is PolygonBacktesting.get_last_price.
How about I rename update_pandas_data to update_data_store and drop the option to update pandas_data. That way if someone tries to upgrade and this would break them then they'll know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind. I had it backwards.
The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):
lumibot/lumibot/strategies/_strategy.py
Line 215 in 0599fd7
lumibot/lumibot/data_sources/pandas_data.py
Line 65 in 0599fd7
It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.
I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to do no harm I've made the default value of PolygonDataBacktesting.MAX_STORAGE_BYTES be None. I'm concerned that I don't understand what's happening with the PolygonDataBacktesting use of pandas_data, _data_store, and load_data. This change shouldn't have any effect (apart from the rename of update_pandas_data to _update_pandas_data which I expect isn't used by apps) unless the user enables it by assigning a suitable value to MAX_STORAGE_BYTES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds correct, if my memory serves right, pandas data is only used on the initial load whereas data store is used on an ongoing basis. I sent you a link to my calendar for us to speak (https://calendly.com/lumi-rob/30min). This seems sufficiently complicated to justify a zoom call