Add LRU eviction with 1gb memory limit for PandasData #392

jimwhite · 2024-03-10T18:52:17Z

Use OrderedDict to implement simple LRU cache of the PandasData.pandas_data and ._data_store in-memory cache dict.

Fixes #391 (OOM when using PolygonBacktesting with many (100+))

I chose 1gb and the memory limit (configurable of course) because it WFM.

This PR includes upstream changes from the #387 (Fix repeated bar fetches from Polygon along with some minor nits) and #388 (Change type to asset_type in pandas_data.get_asset_by_symbol) PRs. I realize that makes this look messy but that's the state of this branch because that's how I'm running and testing this code. When those other PRs are merged then this will be easier to review.

The relevant changes here are just the ones for the lumibot/data_sources/pandas_data.py and lumibot/backtesting/polygon_backtesting.py file.

…age_limit method.

… to help future sleuthing.

…_data_from_polygon will return them).

korbit-ai

My review is in progress 📖 - I will have feedback for you in a few minutes!

korbit-ai

I have reviewed your code and found 16 potential issues. To discuss my individual comments that I have added, tag me in replies using @korbit-ai-mentor.

Please react with a 👍 to my comments that you find helpful and a 👎 to those you find unhelpful - this will help me learn and improve as we collaborate.

korbit-ai · 2024-03-10T18:54:07Z

lumibot/entities/data.py

+    # ./lumibot/build/__editable__.lumibot-3.1.14-py3-none-any/lumibot/entities/data.py:280: 
+    # FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. 
+    # Call result.infer_objects(copy=False) instead.
+    # To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`


The warning comment about the deprecation of downcasting object dtype arrays on .fillna, .ffill, .bfill is important and should be addressed. The code should be updated to use the recommended method to avoid future issues when the behavior changes in a future version of pandas.

korbit-ai · 2024-03-10T18:54:08Z

lumibot/tools/polygon_helper.py

-        # Check if the dataframes are the same
-        if df_all.equals(df_feather):
-            return
+def update_cache(cache_file, df_all, missing_dates=None):


The function 'update_cache' is performing multiple responsibilities, such as handling missing dates and updating the cache file. This violates the Single Responsibility Principle. Consider refactoring the function to separate concerns and improve code maintainability.

korbit-ai · 2024-03-10T18:54:10Z

lumibot/tools/polygon_helper.py

+    missing_dates = get_missing_dates(df_all, asset, start, end)
+    update_cache(cache_file, df_all, missing_dates)
+
+    # TODO: Do this upstream so we don't have to reload feather repeatedly for known-to-be-missing bars.


The use of 'TODO' comments in the code indicates areas that need improvement or further development. It is best practice to track these tasks outside the codebase, such as in an issue tracker, and to avoid deploying 'TODO' comments to production as they can be easily overlooked.

korbit-ai · 2024-03-10T18:54:11Z

lumibot/strategies/strategy.py

@@ -1859,7 +1859,7 @@ def get_last_price(self, asset, quote=None, exchange=None, should_use_last_close
        """

        # Check if the asset is valid
-        if asset is None or (type(asset) == Asset and asset.is_valid() is False):
+        if asset is None or (isinstance(asset, Asset) and not asset.is_valid()):


The change in the condition to check the asset's validity is good as it uses isinstance() for type checking, which is more Pythonic. However, ensure that the Asset class has an is_valid() method implemented. If the method is not implemented, this change could lead to a runtime error when calling asset.is_valid().

korbit-ai · 2024-03-10T18:54:12Z

lumibot/entities/bars.py

@@ -171,7 +171,7 @@ def get_last_price(self):
        float

        """
-        return self.df["close"][-1]
+        return self.df["close"].iloc[-1]


While using .iloc[-1] to access the last element of a pandas DataFrame is a good practice, it's important to ensure that the DataFrame is not empty. If the DataFrame is empty, this will raise an IndexError. It would be good to add a check to ensure the DataFrame is not empty before attempting to access the last element.

korbit-ai · 2024-03-10T18:54:13Z

lumibot/tools/helpers.py

+    elif isinstance(dt_in, dt.datetime) and (dt_in.tzinfo.utcoffset(dt_in) is None):
+        # TODO: This will fail because an exception is thrown if tzinfo is not None.
        return LUMIBOT_DEFAULT_PYTZ.localize(dt_in)


The code attempts to localize a datetime object that already has a timezone but no UTC offset. However, the comment indicates that this will fail because an exception is thrown if the datetime object's tzinfo attribute is not None. This could lead to unexpected behavior or errors. It would be better to handle this case properly, either by converting the datetime object to a naive datetime before localizing it, or by handling the exception in some way.

korbit-ai · 2024-03-10T18:54:15Z

tests/test_polygon_helper.py

+
+        # Polygon is only called once for the same date range even when they are all missing.
+        mock_polyclient().get_aggs.return_value = []
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

korbit-ai · 2024-03-10T18:54:16Z

tests/test_polygon_helper.py

+        if df is None:
+            df = pd.DataFrame()
+        assert len(df) == 0
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

korbit-ai · 2024-03-10T18:54:18Z

tests/test_polygon_helper.py

+            ],
+        ]
+        mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

korbit-ai · 2024-03-10T18:54:19Z

tests/test_polygon_helper.py

+        assert mock_polyclient().get_aggs.call_count == 3
+        assert expected_cachefile.exists()
+        assert len(df) == 7
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test cases added for caching behavior do not cover scenarios where the cache file might be partially updated with new data. It's important to test the merging of new data with existing cached data to ensure the cache remains accurate and up-to-date.

korbit-ai · 2024-03-10T18:54:20Z

tests/test_polygon_helper.py

+
+        # Polygon is only called once for the same date range even when they are all missing.
+        mock_polyclient().get_aggs.return_value = []
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

korbit-ai · 2024-03-10T18:54:22Z

tests/test_polygon_helper.py

+        if df is None:
+            df = pd.DataFrame()
+        assert len(df) == 0
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

korbit-ai · 2024-03-10T18:54:23Z

tests/test_polygon_helper.py

+            ],
+        ]
+        mock_polyclient().get_aggs.side_effect = aggs_result_list + aggs_result_list if force_cache_update else aggs_result_list
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

korbit-ai · 2024-03-10T18:54:24Z

tests/test_polygon_helper.py

+        assert mock_polyclient().get_aggs.call_count == 3
+        assert expected_cachefile.exists()
+        assert len(df) == 7
+        df = ph.get_price_data_from_polygon(api_key, asset, start_date, end_date, timespan, force_cache_update=force_cache_update)


The test case 'test_polygon_missing_day_caching' does not verify the content of the DataFrame returned by get_price_data_from_polygon when force_cache_update is True. It's important to assert the expected behavior when the cache is forcibly updated to ensure data integrity.

korbit-ai · 2024-03-10T18:54:25Z

lumibot/tools/polygon_helper.py

+        df_concat = pd.concat([df_all, missing_df]).sort_index()
+        # Let's be careful and check for duplicates to avoid corrupting the feather file.
+        if df_concat.index.duplicated().any():
+            logging.warn(f"Duplicate index entries found when trying to update Polygon cache {cache_file}")


The warning message logged when duplicate index entries are found could be improved by providing more actionable information. Instead of just logging a warning, consider raising an exception or implementing a recovery mechanism to handle this scenario more gracefully.

korbit-ai · 2024-03-10T18:54:27Z

lumibot/entities/data.py

@@ -519,7 +524,7 @@ def get_bars(self, dt, length=1, timestep=MIN_TIMESTEP, timeshift=0):
        if data is None:
            return None

-        df = pd.DataFrame(data).set_index("datetime")
+        df = pd.DataFrame(data).assign(datetime=lambda df: pd.to_datetime(df['datetime'])).set_index('datetime')


The code modification in the get_bars method uses lambda to convert the 'datetime' column to datetime objects. While this works, it's more efficient to use the pd.to_datetime function directly on the column without a lambda for better performance.

…as_data_keys.

Merge remote-tracking branch 'origin/pandas_cache' into pandas_cache

grzesir · 2024-03-10T22:54:35Z

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

jimwhite · 2024-03-10T23:08:44Z

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

Merge branch 'dev' of https://github.com/fovi-llc/lumibot into pandas_cache

grzesir · 2024-03-11T23:03:08Z

lumibot/backtesting/polygon_backtesting.py

@@ -37,7 +37,7 @@ def __init__(
        # RESTClient API for Polygon.io polygon-api-client
        self.polygon_client = RESTClient(self._api_key)

-    def update_pandas_data(self, asset, quote, length, timestep, start_dt=None):
+    def update_pandas_data(self, asset, quote, length, timestep, start_dt=None, update_data_store=False):


can you please add to the docstring details about "update_data_store" and what it does?

let me know when you add this and i can merge

Comment added. I've also changed the name to _update_pandas_data to reflect its internal intent and will alert me if someone has a problem with that.

grzesir · 2024-03-11T23:04:25Z

lumibot/backtesting/polygon_backtesting.py


-        return pandas_data_updated


why is pandas_data_updated no longer being returned? are you sure this won't break anything?

The only references to update_pandas_data in the library are in this file and it looked to be an internal method to me. The purpose of the method seems to be about the side effect of updating PandasData.pandas_data and (sometimes) PandasData._data_store. If update_pandas_data is part of the API I can certainly put the return values back in.

My question is why do we have both _data_store and pandas_data in the super class PandasData? They hold nearly the same data but there is some kind of staging effect (i.e. out of sync behavior) whose top-level is PandasData.load_data. For big backtests that doubling of RAM usage is a significant cost and I don't know what it is for (although as I mentioned elsewhere I haven't read the docs...). If the two copies are needed and the selective updating has a purpose then I'll include that when adding a docstring to update_data_store.

Ok that makes sense, in that case I think you're right to re-arrange it like that.

As for your question about _data_store vs pandas_data: I believe one of them is used for inputting data at the very beginning, while the other is used at the actual data store. I think you're right that we do not need both and would probably be better off removing one of them.

Hmmm, yes. After looking at the code some more, even though the name is update_pandas_data it seems the actual purpose is to update _data_store.

AFAICT pandas_data is only used by load_store. It could be the idea is to do get_X calls in order to populate pandas_data before load_data. But that doesn't seem to make a lot of sense because the only place that updates both is PolygonBacktesting.get_last_price.

How about I rename update_pandas_data to update_data_store and drop the option to update pandas_data. That way if someone tries to upgrade and this would break them then they'll know.

Oh, nevermind. I had it backwards.

The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):

lumibot/lumibot/strategies/_strategy.py

Line 215 in 0599fd7

self.broker.data_source.load_data()

lumibot/lumibot/data_sources/pandas_data.py

Line 65 in 0599fd7

def load_data(self):

It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.

I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().

In order to do no harm I've made the default value of PolygonDataBacktesting.MAX_STORAGE_BYTES be None. I'm concerned that I don't understand what's happening with the PolygonDataBacktesting use of pandas_data, _data_store, and load_data. This change shouldn't have any effect (apart from the rename of update_pandas_data to _update_pandas_data which I expect isn't used by apps) unless the user enables it by assigning a suitable value to MAX_STORAGE_BYTES.

Oh, nevermind. I had it backwards.

The overrides PolygonBacktesting._pull_source_symbol_bars and PolygonBacktesting.get_historical_prices_between_dates call update_pandas_data and don't update _data_store. So they rely on calling PandasData.load_data by _Strategy.init to do some fixing up of dates or something related to options ("expiries"):

lumibot/lumibot/strategies/_strategy.py

Line 215 in 0599fd7

self.broker.data_source.load_data()

lumibot/lumibot/data_sources/pandas_data.py

Line 65 in 0599fd7

def load_data(self):

It seems as though those adjustments done by PandasData.load_data are only done when the backtest begins and won't happen for symbols that aren't preloaded into pandas_data. Is that correct? It seems like that means symbols that aren't known before the backtest begins can't be used, or if they are then they won't get whatever those adjustments are that load_data does.

I've only tested this change on the existing lumibot tests and my backtest which is just stocks and uses Strategy.get_historical_prices() and get_position().

Yes that sounds correct, if my memory serves right, pandas data is only used on the initial load whereas data store is used on an ongoing basis. I sent you a link to my calendar for us to speak (https://calendly.com/lumi-rob/30min). This seems sufficiently complicated to justify a zoom call

grzesir · 2024-03-12T00:12:00Z

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

grzesir · 2024-03-12T00:16:20Z

lumibot/backtesting/polygon_backtesting.py

-        if pandas_data_update is not None:
-            # Add the keys to the self.pandas_data dictionary
-            self.pandas_data.update(pandas_data_update)
+        self.update_pandas_data(asset, quote, 1, timestep)


why is update_data_store=False here? shouldnt we also update the data store here if needed?

I don't know why only pandas_data is updated in three places and _data_store is updated in one.

grzesir · 2024-03-12T00:16:26Z

lumibot/backtesting/polygon_backtesting.py

-        if pandas_data_update is not None:
-            # Add the keys to the self.pandas_data dictionary
-            self.pandas_data.update(pandas_data_update)
+        self.update_pandas_data(asset, quote, length, timestep, start_dt)


why is update_data_store=False here? shouldnt we also update the data store here if needed?

As above. Would be happy to hear from or chat with someone who knows how this is supposed to work.

@jimwhite I'd be happy to jump on a call with you to walk through this. You can pick a time on my calendar here: https://calendly.com/lumi-rob/30min

jimwhite · 2024-03-12T01:31:05Z

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

Whether limiting the _data_store size will speed things up depends on how much RAM a process is using vs the physical RAM. So if those strategies have Python processes that are taking dozens of GB on a typical 16 ~ 32 GB machine then I expect the answer will be yes. I've not run anything with options yet so I don't know what the memory usage looks like. By not duplicating the data into pandas_data (which only happens for get_last_price) as discussed above the usage will be half so I could increase the default MAX_STORAGE_BYTES to 2gb for the same footprint.

As my follow up comment above says, it seems the pandas_data and _data_store thing is related to load_data being called for options strategies. So if options strategies depend on having all of their data loaded at the start of the backtest then eviction won't help and in fact would hurt so I'll need to make sure it doesn't happen in that case (if that is how it works).

…_cache

grzesir · 2024-03-12T08:14:17Z

@jimwhite Does this improve Backtesting performance? And if so, by how much approximately?

Yeah, even before it crashes from OOM there is a nice performance boost. Before this fix it would take about 5 hours to run 6 months of a 48 months backtest. Now the whole 4 years is run in about 6 hours. When this kicks in depends on how much physical RAM your machine has and whether you're referencing enough symbols to matter (i.e. at least hundreds).

That sounds impressive. Do you think this would also be the case with options strategies that work with hundreds of different options symbols? Excited to try out the speed boost

Whether limiting the _data_store size will speed things up depends on how much RAM a process is using vs the physical RAM. So if those strategies have Python processes that are taking dozens of GB on a typical 16 ~ 32 GB machine then I expect the answer will be yes. I've not run anything with options yet so I don't know what the memory usage looks like. By not duplicating the data into pandas_data (which only happens for get_last_price) as discussed above the usage will be half so I could increase the default MAX_STORAGE_BYTES to 2gb for the same footprint.

As my follow up comment above says, it seems the pandas_data and _data_store thing is related to load_data being called for options strategies. So if options strategies depend on having all of their data loaded at the start of the backtest then eviction won't help and in fact would hurt so I'll need to make sure it doesn't happen in that case (if that is how it works).

Excited to try this, merging now

Jim White and others added 7 commits March 9, 2024 21:48

Refactor PandasData and PolygonBacktesting to support an enforce_stor…

e248db3

…age_limit method.

Encountered this chasing a "Not naive time" exception. Split the line…

93744e2

… to help future sleuthing.

Handle empty dataframes (which occur more commonly now that get_price…

83cf721

…_data_from_polygon will return them).

Take note of future warning about downcasting in dataframes.

202d761

Simple LRU scheme for limiting memory cache usage working.

6cbd2f4

Tweak the logging message to include number of items.

05c9087

Merge branch 'Lumiwealth:dev' into pandas_cache

84742f7

jimwhite requested a review from grzesir as a code owner March 10, 2024 18:52

korbit-ai bot reviewed Mar 10, 2024

View reviewed changes

korbit-ai bot added the korbit-code-analysis label Mar 10, 2024

Jim White added 3 commits March 10, 2024 12:04

Make PandasData._enforce_storage_limit a static method like _set_pand…

59c7dfc

…as_data_keys.

Merge the upstream dev changes.

31a9e52

Merge remote-tracking branch 'origin/pandas_cache' into pandas_cache

Use generator rather than list.

d0f2fc3

jimwhite had a problem deploying to unit-tests March 10, 2024 22:51 — with GitHub Actions Failure

Merged upstream and resolved conflict.

957b69c

jimwhite had a problem deploying to unit-tests March 11, 2024 02:18 — with GitHub Actions Failure

Merge upstream dev.

ab5f892

Merge branch 'dev' of https://github.com/fovi-llc/lumibot into pandas_cache

jimwhite had a problem deploying to unit-tests March 11, 2024 04:04 — with GitHub Actions Failure

grzesir requested changes Mar 11, 2024

View reviewed changes

grzesir requested changes Mar 12, 2024

View reviewed changes

Jim White added 2 commits March 11, 2024 23:02

Merge branch 'dev' of https://github.com/fovi-llc/lumibot into pandas…

f5dadde

…_cache

Disable pandas_data eviction by default.

5b7725d

jimwhite had a problem deploying to unit-tests March 12, 2024 06:29 — with GitHub Actions Failure

Correct issue link.

63fb704

jimwhite had a problem deploying to unit-tests March 12, 2024 06:33 — with GitHub Actions Failure

grzesir approved these changes Mar 12, 2024

View reviewed changes

grzesir merged commit c567200 into Lumiwealth:dev Mar 12, 2024
1 check failed

Add LRU eviction with 1gb memory limit for PandasData #392

Add LRU eviction with 1gb memory limit for PandasData #392

Conversation

jimwhite commented Mar 10, 2024 • edited Loading

korbit-ai bot left a comment

Choose a reason for hiding this comment

korbit-ai bot left a comment

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

korbit-ai bot Mar 10, 2024

Choose a reason for hiding this comment

grzesir commented Mar 10, 2024

jimwhite commented Mar 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimwhite Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grzesir commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimwhite Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimwhite commented Mar 12, 2024 • edited Loading

grzesir commented Mar 12, 2024

jimwhite commented Mar 10, 2024 •

edited

Loading

jimwhite Mar 12, 2024 •

edited

Loading

jimwhite Mar 12, 2024 •

edited

Loading

jimwhite commented Mar 12, 2024 •

edited

Loading