Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LRU eviction with 1gb memory limit for PandasData #392

Merged
merged 15 commits into from
Mar 12, 2024

Conversation

jimwhite
Copy link
Collaborator

@jimwhite jimwhite commented Mar 10, 2024

Use OrderedDict to implement simple LRU cache of the PandasData.pandas_data and ._data_store in-memory cache dict.

Fixes #391 (OOM when using PolygonBacktesting with many (100+))

I chose 1gb and the memory limit (configurable of course) because it WFM.

This PR includes upstream changes from the #387 (Fix repeated bar fetches from Polygon along with some minor nits) and #388 (Change type to asset_type in pandas_data.get_asset_by_symbol) PRs. I realize that makes this look messy but that's the state of this branch because that's how I'm running and testing this code. When those other PRs are merged then this will be easier to review.

The relevant changes here are just the ones for the lumibot/data_sources/pandas_data.py and lumibot/backtesting/polygon_backtesting.py file.

@jimwhite jimwhite requested a review from grzesir as a code owner March 10, 2024 18:52
Copy link
Contributor

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review is in progress 📖 - I will have feedback for you in a few minutes!

Copy link
Contributor

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed your code and found 16 potential issues. To discuss my individual comments that I have added, tag me in replies using @korbit-ai-mentor.


Please react with a 👍 to my comments that you find helpful and a 👎 to those you find unhelpful - this will help me learn and improve as we collaborate.

Comment on lines +271 to +274
# ./lumibot/build/__editable__.lumibot-3.1.14-py3-none-any/lumibot/entities/data.py:280:
# FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version.
# Call result.infer_objects(copy=False) instead.
# To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Code Design Improvements priority 7

The warning comment about the deprecation of downcasting object dtype arrays on .fillna, .ffill, .bfill is important and should be addressed. The code should be updated to use the recommended method to avoid future issues when the behavior changes in a future version of pandas.

# Check if the dataframes are the same
if df_all.equals(df_feather):
return
def update_cache(cache_file, df_all, missing_dates=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Code Design Improvements priority 5

The function 'update_cache' is performing multiple responsibilities, such as handling missing dates and updating the cache file. This violates the Single Responsibility Principle. Consider refactoring the function to separate concerns and improve code maintainability.

missing_dates = get_missing_dates(df_all, asset, start, end)
update_cache(cache_file, df_all, missing_dates)

# TODO: Do this upstream so we don't have to reload feather repeatedly for known-to-be-missing bars.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Code Design Improvements priority 4

The use of 'TODO' comments in the code indicates areas that need improvement or further development. It is best practice to track these tasks outside the codebase, such as in an issue tracker, and to avoid deploying 'TODO' comments to production as they can be easily overlooked.

@@ -1859,7 +1859,7 @@ def get_last_price(self, asset, quote=None, exchange=None, should_use_last_close
"""

# Check if the asset is valid
if asset is None or (type(asset) == Asset and asset.is_valid() is False):
if asset is None or (isinstance(asset, Asset) and not asset.is_valid()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Critical Errors priority 7

The change in the condition to check the asset's validity is good as it uses isinstance() for type checking, which is more Pythonic. However, ensure that the Asset class has an is_valid() method implemented. If the method is not implemented, this change could lead to a runtime error when calling asset.is_valid().

@@ -171,7 +171,7 @@ def get_last_price(self):
float

"""
return self.df["close"][-1]
return self.df["close"].iloc[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

While using .iloc[-1] to access the last element of a pandas DataFrame is a good practice, it's important to ensure that the DataFrame is not empty. If the DataFrame is empty, this will raise an IndexError. It would be good to add a check to ensure the DataFrame is not empty before attempting to access the last element.

Comment on lines +147 to 149
elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo.utcoffset(dt_in) is None):
# TODO: This will fail because an exception is thrown if tzinfo is not None.
return LUMIBOT_DEFAULT_PYTZ.localize(dt_in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The code attempts to localize a datetime object that already has a timezone but no UTC offset. However, the comment indicates that this will fail because an exception is thrown if the datetime object's tzinfo attribute is not None. This could lead to unexpected behavior or errors. It would be better to handle this case properly, either by converting the datetime object to a naive datetime before localizing it, or by handling the exception in some way.


# Polygon is only called once for the same date range even when they are all missing.
mock_polyclient().get_aggs.return_value = []
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

if df is None:
df = pd.DataFrame()
assert len(df) == 0
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

],
]
mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

assert mock_polyclient().get_aggs.call_count == 3
assert expected_cachefile.exists()
assert len(df) == 7
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.


# Polygon is only called once for the same date range even when they are all missing.
mock_polyclient().get_aggs.return_value = []
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

if df is None:
df = pd.DataFrame()
assert len(df) == 0
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

],
]
mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

assert mock_polyclient().get_aggs.call_count == 3
assert expected_cachefile.exists()
assert len(df) == 7
df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 7

The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

df_concat = pd.concat([df_all, missing_df]).sort_index()
# Let's be careful and check for duplicates to avoid corrupting the feather file.
if df_concat.index.duplicated().any():
logging.warn(f"Duplicate index entries found when trying to update Polygon cache {cache_file}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Bug Risk priority 6

The warning message logged when duplicate index entries are found could be improved by providing more actionable information. Instead of just logging a warning, consider raising an exception or implementing a recovery mechanism to handle this scenario more gracefully.

@@ -519,7 +524,7 @@ def get_bars(self, dt, length=1, timestep=MIN_TIMESTEP, timeshift=0):
if data is None:
return None

df = pd.DataFrame(data).set_index("datetime")
df = pd.DataFrame(data).assign(datetime=lambda df: pd.to_datetime(df['datetime'])).set_index('datetime')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category Performance priority 5

The code modification in the get_bars method uses lambda to convert the 'datetime' column to datetime objects. While this works, it's more efficient to use the pd.to_datetime function directly on the column without a lambda for better performance.

Jim White added 3 commits March 10, 2024 12:04
@grzesir
Copy link
Contributor

grzesir commented Mar 10, 2024

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

@jimwhite
Copy link
Collaborator Author

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

Merge branch 'dev' of https://github.com/fovi-llc/lumibot into pandas_cache
@@ -37,7 +37,7 @@ def __init__(
# RESTClient API for Polygon.io polygon-api-client
self.polygon_client = RESTClient(self._api_key)

def update_pandas_data(self, asset, quote, length, timestep, start_dt=None):
def update_pandas_data(self, asset, quote, length, timestep, start_dt=None, update_data_store=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add to the docstring details about "update_data_store" and what it does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know when you add this and i can merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added. I've also changed the name to _update_pandas_data to reflect its internal intent and will alert me if someone has a problem with that.


return pandas_data_updated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is pandas_data_updated no longer being returned? are you sure this won't break anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only references to update_pandas_data in the library are in this file and it looked to be an internal method to me. The purpose of the method seems to be about the side effect of updating PandasData.pandas_data and (sometimes) PandasData._data_store. If update_pandas_data is part of the API I can certainly put the return values back in.

My question is why do we have both _data_store and pandas_data in the super class PandasData? They hold nearly the same data but there is some kind of staging effect (i.e. out of sync behavior) whose top-level is PandasData.load_data. For big backtests that doubling of RAM usage is a significant cost and I don't know what it is for (although as I mentioned elsewhere I haven't read the docs...). If the two copies are needed and the selective updating has a purpose then I'll include that when adding a docstring to update_data_store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that makes sense, in that case I think you're right to re-arrange it like that.

As for your question about _data_store vs pandas_data: I believe one of them is used for inputting data at the very beginning, while the other is used at the actual data store. I think you're right that we do not need both and would probably be better off removing one of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, yes. After looking at the code some more, even though the name is update_pandas_data it seems the actual purpose is to update _data_store.

AFAICT pandas_data is only used by load_store. It could be the idea is to do get_X calls in order to populate pandas_data before load_data. But that doesn't seem to make a lot of sense because the only place that updates both is PolygonBacktesting.get_last_price.

How about I rename update_pandas_data to update_data_store and drop the option to update pandas_data. That way if someone tries to upgrade and this would break them then they'll know.

Copy link
Collaborator Author

@jimwhite jimwhite Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind. I had it backwards.

The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):

self.broker.data_source.load_data()

def load_data(self):

It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.

I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to do no harm I've made the default value of PolygonDataBacktesting.MAX_STORAGE_BYTES be None. I'm concerned that I don't understand what's happening with the PolygonDataBacktesting use of pandas_data, _data_store, and load_data. This change shouldn't have any effect (apart from the rename of update_pandas_data to _update_pandas_data which I expect isn't used by apps) unless the user enables it by assigning a suitable value to MAX_STORAGE_BYTES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind. I had it backwards.

The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):

self.broker.data_source.load_data()

def load_data(self):

It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.

I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().

Yes that sounds correct, if my memory serves right, pandas data is only used on the initial load whereas data store is used on an ongoing basis. I sent you a link to my calendar for us to speak (https://calendly.com/lumi-rob/30min). This seems sufficiently complicated to justify a zoom call

@grzesir
Copy link
Contributor

grzesir commented Mar 12, 2024

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

if pandas_data_update is not None:
# Add the keys to the self.pandas_data dictionary
self.pandas_data.update(pandas_data_update)
self.update_pandas_data(asset, quote, 1, timestep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is update_data_store=False here? shouldnt we also update the data store here if needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why only pandas_data is updated in three places and _data_store is updated in one.

if pandas_data_update is not None:
# Add the keys to the self.pandas_data dictionary
self.pandas_data.update(pandas_data_update)
self.update_pandas_data(asset, quote, length, timestep, start_dt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is update_data_store=False here? shouldnt we also update the data store here if needed?

Copy link
Collaborator Author

@jimwhite jimwhite Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above. Would be happy to hear from or chat with someone who knows how this is supposed to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimwhite I'd be happy to jump on a call with you to walk through this. You can pick a time on my calendar here: https://calendly.com/lumi-rob/30min

@jimwhite
Copy link
Collaborator Author

jimwhite commented Mar 12, 2024

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

Whether limiting the _data_store size will speed things up depends on how much RAM a process is using vs the physical RAM. So if those strategies have Python processes that are taking dozens of GB on a typical 16 ~ 32 GB machine then I expect the answer will be yes. I've not run anything with options yet so I don't know what the memory usage looks like. By not duplicating the data into pandas_data (which only happens for get_last_price) as discussed above the usage will be half so I could increase the default MAX_STORAGE_BYTES to 2gb for the same footprint.

As my follow up comment above says, it seems the pandas_data and _data_store thing is related to load_data being called for options strategies. So if options strategies depend on having all of their data loaded at the start of the backtest then eviction won't help and in fact would hurt so I'll need to make sure it doesn't happen in that case (if that is how it works).

@grzesir
Copy link
Contributor

grzesir commented Mar 12, 2024

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

Whether limiting the _data_store size will speed things up depends on how much RAM a process is using vs the physical RAM. So if those strategies have Python processes that are taking dozens of GB on a typical 16 ~ 32 GB machine then I expect the answer will be yes. I've not run anything with options yet so I don't know what the memory usage looks like. By not duplicating the data into pandas_data (which only happens for get_last_price) as discussed above the usage will be half so I could increase the default MAX_STORAGE_BYTES to 2gb for the same footprint.

As my follow up comment above says, it seems the pandas_data and _data_store thing is related to load_data being called for options strategies. So if options strategies depend on having all of their data loaded at the start of the backtest then eviction won't help and in fact would hurt so I'll need to make sure it doesn't happen in that case (if that is how it works).

Excited to try this, merging now

@grzesir grzesir merged commit c567200 into Lumiwealth:dev Mar 12, 2024
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OOM when using PolygonBacktesting with many (100+) stocks
3 participants