Memoization to S3 #28

atspaeth · 2023-09-16T19:30:11Z

This PR adds a decorator which uses joblib.Memory to cache the results of
arbitrary functions to S3. I also updated the NumpyS3Memmap unit test while
I was writing mine.

@davidparks21 this adds joblib as an optional dependency, which won't be
imported unless you try to import braingeneers.utils.memoize_s3. What's
the best way to notate this in the package metadata? (For now I just added
it as a hard require in the "data" dependency group.)

Previously this decorator required doubly invoking the decorator to use the keyword arguments of `Memory.cache()`, in particular `ignore`, which was fairly annoying. Now those keyword arguments are accepted too, and passed on to `Memory.cache()` instead of the `Memory` constructor.

It's required for `memoize_s3`, so it should be an optional dependency at some level. Most likely it only makes sense to import it together with the "data" dependencies, so this commit puts it there...

@memoize

This commit fixes three issues: 1. Previously I had implemented a method that is only used for cache eviction, which would have made the cache LIFO instead of LRU. By removing this, cache eviction simply doesn't happen, which is less confusing. This is also mentioned in the documentation. 2. Some keyword arguments wouldn't have been accepted in some formats of calling the @memoize decorator. 3. A version constraint on joblib was missing: one of the keyword arguments supported here doesn't exist before `1.3`.

davidparks21 · 2023-09-19T22:38:31Z

That's an interesting question. Cordero is switching us to the newer dependency/installer format recommended by PiPy (the current way we do it is deprecated now).

@uwcdc are we changing the way that optional dependencies are handled in the new installation method?

The issue that Alex is facing here is that he has a dependency on something that logically goes in the utils package, and the utils package is installed by the minimum install version (for RaspPI's for example). I don't understand the new installer well enough to say how best he should handle the optional dependency.

DailyDreaming

Hello Alex, I thought I'd chip in. I have one concern with how the directories are set (and can be recursively deleted).

I also wanted to ask how this is going to be used in practice (for example, which function or functions currently)?

braingeneers/utils/configure.py

DailyDreaming · 2023-09-19T22:25:13Z

braingeneers/utils/memoize_s3.py

+        )(location)
+
+    if location is None and backend == "s3":
+        location = f"s3://braingeneersdev/{os.environ['S3_USER']}/cache"


Is "S3_USER" always guaranteed to be set?

Also, it might be bad if user names are in the (currently relatively uncluttered) root directory. So I'd like to suggest the default location be:

location = f"s3://braingeneersdev/cache/{os.environ['S3_USER']}"

I would in fact, also enforce that here, and raise an Exception if a user selects any directory without the prefix "s3://braingeneersdev/cache/". It may be problematic if a user selects "s3://braingeneersdev/" or "s3://braingeneersdev/ephys" as their cache dir unknowingly, and then decides to clean the cache dir.

The reason I chose this location is actually that braingeneersdev has a bunch of username directories already in it. If they're not supposed to be there, I'm happy to change this to whatever default you guys think is reasonable.

As for $S3_USER, unfortunately it's definitely not guaranteed to be set. The idea was to get a particular user's uses of these memoized functions to be saved into their own user prefix, so it's supposed to fail if the user hasn't set it, since there's no way to find out where that is in general.

Ah, sorry about the usernames, I was looking in s3://braingeneers and not s3://braingeneersdev now I realize.

As for $S3_USER, unfortunately it's definitely not guaranteed to be set. The idea was to get a particular user's uses of these memoized functions to be saved into their own user prefix, so it's supposed to fail if the user hasn't set it, since there's no way to find out where that is in general.

If that's the case, I would recommend something like:

S3_USER = os.environ.get('S3_USER', 'common')

DailyDreaming · 2023-09-19T22:42:13Z

braingeneers/utils/memoize_s3.py

+
+    def clear_location(self, location):
+        # Recursive delete.
+        wr.s3.delete_objects(glob.escape(location))


I make a suggestion in a following comment, but I would enforce a cache dir prefix here and error if trying to delete outside of the cache dir. The following suggested prefix is s3://braingeneersdev/cache/ . A user should not be able to accidentally clear a cache if they naively set, for example, s3://braingeneersdev/ephys/ as their cache dir.

That's a good point, but it's also a lot less dangerous than it looks because of the way joblib uses the cache dir you give it. A call like this:

@memoize("s3://braingeneers/") def bar(baz): ...

in the module foo.py has its actual cache files stored under s3://braingeneers/joblib/foo/bar/, so only things under that whole prefix get deleted when the cache is cleared.

Perfect. I would still check for the s3://braingeneers/joblib prefix then, in that case. It's good to be a little paranoid about recursive deletion, I think.

Yeah, fair. I'll check that the requested location actually starts with self.location + '/joblib/' and then nothing weird should happen

(edit: self.location already includes the /joblib/ part, so the commit implementing this is right and what I originally said in this comment is wrong)

atspaeth · 2023-09-19T23:05:07Z

I'm planning to use this to run computations in containers and then analyze the results on my PC. I've been using it for spiking neuronal simulations so far (so the function parameters are things like number and type of neurons, connectivity, etc, and the result is a fairly large SpikeData object containing all the neural activity) and today I'm switching to also using it for some HMM fits

@davidparks21

@davidparks21 pointed out that utils are supposed to be usable even with only the "minimal" dependencies, and OKed moving `joblib`, `awswrangler`, and `smart_open` all to "minimal". This also allows removing `awswrangler` and `smart_open` from the "analysis" dependencies, so their versions only have to be specified in one place. @DailyDreaming suggested putting an upper limit on the joblib dependency as well in case there are breaking changes in the next major version.

DailyDreaming · 2023-09-19T23:30:47Z

I'm planning to use this to run computations in containers and then analyze the results on my PC. I've been using it for spiking neuronal simulations so far (so the function parameters are things like number and type of neurons, connectivity, etc, and the result is a fairly large SpikeData object containing all the neural activity) and today I'm switching to also using it for some HMM fits

Thanks for explaining. This might be a source of large amounts of data in need of clean up in the future, so I wanted to understand how cosmopolitan the usage might be.

atspaeth · 2023-09-19T23:35:54Z

This might be a source of large amounts of data in need of clean up in the future

Yeah, I was worried about this too. joblib tries to support an LRU eviction policy where you can limit the size of a cache, but S3 doesn't seem to support access time natively, so I think I'd have to do something weird involving adding custom metadata attributes to every file created with this method and updating them on load. I'm 95% sure it's possible but didn't want to have to deal with it yet :)

DailyDreaming · 2023-09-19T23:41:07Z

Yeah, I was worried about this too. joblib tries to support an LRU eviction policy where you can limit the size of a cache, but S3 doesn't seem to support access time natively, so I think I'd have to do something weird involving adding custom metadata attributes to every file created with this method and updating them on load. I'm 95% sure it's possible but didn't want to have to deal with it yet :)

Completely understandable. It shouldn't be an issue for a while in any case. :)

Add a guard in `S3StoreBackend.delete_location()` checking that the location to delete ends with `self.location` so it's less likely to accidentally delete some other directory.

atspaeth added the enhancement New feature or request label Sep 16, 2023

atspaeth requested a review from davidparks21 September 18, 2023 19:17

atspaeth added 5 commits September 18, 2023 20:56

refactor(util): update memmap test

0a1da36

feat(util): add S3 memoization

f6f6cf1

test(util): use of $S3_USER in memoize_s3()

f376ecc

fix(utils): add joblib to data dependencies

ec4f09c

It's required for `memoize_s3`, so it should be an optional dependency at some level. Most likely it only makes sense to import it together with the "data" dependencies, so this commit puts it there...

atspaeth force-pushed the memoize_s3 branch from f31de59 to ec4f09c Compare September 19, 2023 03:56

DailyDreaming reviewed Sep 19, 2023

View reviewed changes

uwcdc assigned lsetiawan and uwcdc Sep 19, 2023

atspaeth added 2 commits September 19, 2023 16:44

feat(memoize): guard against accidental deletion

3ad9064

Add a guard in `S3StoreBackend.delete_location()` checking that the location to delete ends with `self.location` so it's less likely to accidentally delete some other directory.

feat(memoize): make S3_USER default to "common"

f1ccb5d

davidparks21 approved these changes Sep 20, 2023

View reviewed changes

atspaeth merged commit 80761b7 into master Sep 20, 2023

atspaeth deleted the memoize_s3 branch September 20, 2023 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memoization to S3 #28

Memoization to S3 #28

atspaeth commented Sep 16, 2023 •

edited

Loading

davidparks21 commented Sep 19, 2023

DailyDreaming left a comment •

edited

Loading

DailyDreaming Sep 19, 2023

DailyDreaming Sep 19, 2023

atspaeth Sep 19, 2023

atspaeth Sep 19, 2023

DailyDreaming Sep 19, 2023

DailyDreaming Sep 19, 2023

DailyDreaming Sep 19, 2023

atspaeth Sep 19, 2023

DailyDreaming Sep 19, 2023

atspaeth Sep 19, 2023 •

edited

Loading

atspaeth commented Sep 19, 2023

DailyDreaming commented Sep 19, 2023

atspaeth commented Sep 19, 2023 •

edited

Loading

DailyDreaming commented Sep 19, 2023

Memoization to S3 #28

Memoization to S3 #28

Conversation

atspaeth commented Sep 16, 2023 • edited Loading

davidparks21 commented Sep 19, 2023

DailyDreaming left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atspaeth Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

atspaeth commented Sep 19, 2023

DailyDreaming commented Sep 19, 2023

atspaeth commented Sep 19, 2023 • edited Loading

DailyDreaming commented Sep 19, 2023

atspaeth commented Sep 16, 2023 •

edited

Loading

DailyDreaming left a comment •

edited

Loading

atspaeth Sep 19, 2023 •

edited

Loading

atspaeth commented Sep 19, 2023 •

edited

Loading