Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memoization to S3 #28

Merged
merged 9 commits into from
Sep 20, 2023
Merged

Memoization to S3 #28

merged 9 commits into from
Sep 20, 2023

Conversation

atspaeth
Copy link
Member

@atspaeth atspaeth commented Sep 16, 2023

This PR adds a decorator which uses joblib.Memory to cache the results of
arbitrary functions to S3. I also updated the NumpyS3Memmap unit test while
I was writing mine.

@davidparks21 this adds joblib as an optional dependency, which won't be
imported unless you try to import braingeneers.utils.memoize_s3. What's
the best way to notate this in the package metadata? (For now I just added
it as a hard require in the "data" dependency group.)

@atspaeth atspaeth added the enhancement New feature or request label Sep 16, 2023
Previously this decorator required doubly invoking the decorator to use
the keyword arguments of `Memory.cache()`, in particular `ignore`, which
was fairly annoying. Now those keyword arguments are accepted too, and
passed on to `Memory.cache()` instead of the `Memory` constructor.
It's required for `memoize_s3`, so it should be an optional dependency
at some level. Most likely it only makes sense to import it together
with the "data" dependencies, so this commit puts it there...
This commit fixes three issues:
1. Previously I had implemented a method that is only used for cache
   eviction, which would have made the cache LIFO instead of LRU. By
   removing this, cache eviction simply doesn't happen, which is less
   confusing. This is also mentioned in the documentation.
2. Some keyword arguments wouldn't have been accepted in some formats of
   calling the @memoize decorator.
3. A version constraint on joblib was missing: one of the keyword
   arguments supported here doesn't exist before `1.3`.
@davidparks21
Copy link
Contributor

That's an interesting question. Cordero is switching us to the newer dependency/installer format recommended by PiPy (the current way we do it is deprecated now).

@uwcdc are we changing the way that optional dependencies are handled in the new installation method?

The issue that Alex is facing here is that he has a dependency on something that logically goes in the utils package, and the utils package is installed by the minimum install version (for RaspPI's for example). I don't understand the new installer well enough to say how best he should handle the optional dependency.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Alex, I thought I'd chip in. I have one concern with how the directories are set (and can be recursively deleted).

I also wanted to ask how this is going to be used in practice (for example, which function or functions currently)?

braingeneers/utils/configure.py Outdated Show resolved Hide resolved
)(location)

if location is None and backend == "s3":
location = f"s3://braingeneersdev/{os.environ['S3_USER']}/cache"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "S3_USER" always guaranteed to be set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it might be bad if user names are in the (currently relatively uncluttered) root directory. So I'd like to suggest the default location be:

location = f"s3://braingeneersdev/cache/{os.environ['S3_USER']}"

I would in fact, also enforce that here, and raise an Exception if a user selects any directory without the prefix "s3://braingeneersdev/cache/". It may be problematic if a user selects "s3://braingeneersdev/" or "s3://braingeneersdev/ephys" as their cache dir unknowingly, and then decides to clean the cache dir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I chose this location is actually that braingeneersdev has a bunch of username directories already in it. If they're not supposed to be there, I'm happy to change this to whatever default you guys think is reasonable.

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for $S3_USER, unfortunately it's definitely not guaranteed to be set. The idea was to get a particular user's uses of these memoized functions to be saved into their own user prefix, so it's supposed to fail if the user hasn't set it, since there's no way to find out where that is in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry about the usernames, I was looking in s3://braingeneers and not s3://braingeneersdev now I realize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for $S3_USER, unfortunately it's definitely not guaranteed to be set. The idea was to get a particular user's uses of these memoized functions to be saved into their own user prefix, so it's supposed to fail if the user hasn't set it, since there's no way to find out where that is in general.

If that's the case, I would recommend something like:

S3_USER = os.environ.get('S3_USER', 'common')


def clear_location(self, location):
# Recursive delete.
wr.s3.delete_objects(glob.escape(location))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make a suggestion in a following comment, but I would enforce a cache dir prefix here and error if trying to delete outside of the cache dir. The following suggested prefix is s3://braingeneersdev/cache/ . A user should not be able to accidentally clear a cache if they naively set, for example, s3://braingeneersdev/ephys/ as their cache dir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, but it's also a lot less dangerous than it looks because of the way joblib uses the cache dir you give it. A call like this:

@memoize("s3://braingeneers/")
def bar(baz): ...

in the module foo.py has its actual cache files stored under s3://braingeneers/joblib/foo/bar/, so only things under that whole prefix get deleted when the cache is cleared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I would still check for the s3://braingeneers/joblib prefix then, in that case. It's good to be a little paranoid about recursive deletion, I think.

Copy link
Member Author

@atspaeth atspaeth Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair. I'll check that the requested location actually starts with self.location + '/joblib/' and then nothing weird should happen

(edit: self.location already includes the /joblib/ part, so the commit implementing this is right and what I originally said in this comment is wrong)

@atspaeth
Copy link
Member Author

I'm planning to use this to run computations in containers and then analyze the results on my PC. I've been using it for spiking neuronal simulations so far (so the function parameters are things like number and type of neurons, connectivity, etc, and the result is a fairly large SpikeData object containing all the neural activity) and today I'm switching to also using it for some HMM fits

@davidparks21 pointed out that utils are supposed to be usable even with
only the "minimal" dependencies, and OKed moving `joblib`, `awswrangler`,
and `smart_open` all to "minimal". This also allows removing
`awswrangler` and `smart_open` from the "analysis" dependencies, so
their versions only have to be specified in one place.

@DailyDreaming suggested putting an upper limit on the joblib dependency
as well in case there are breaking changes in the next major version.
@DailyDreaming
Copy link
Member

I'm planning to use this to run computations in containers and then analyze the results on my PC. I've been using it for spiking neuronal simulations so far (so the function parameters are things like number and type of neurons, connectivity, etc, and the result is a fairly large SpikeData object containing all the neural activity) and today I'm switching to also using it for some HMM fits

Thanks for explaining. This might be a source of large amounts of data in need of clean up in the future, so I wanted to understand how cosmopolitan the usage might be.

@atspaeth
Copy link
Member Author

atspaeth commented Sep 19, 2023

This might be a source of large amounts of data in need of clean up in the future

Yeah, I was worried about this too. joblib tries to support an LRU eviction policy where you can limit the size of a cache, but S3 doesn't seem to support access time natively, so I think I'd have to do something weird involving adding custom metadata attributes to every file created with this method and updating them on load. I'm 95% sure it's possible but didn't want to have to deal with it yet :)

@DailyDreaming
Copy link
Member

Yeah, I was worried about this too. joblib tries to support an LRU eviction policy where you can limit the size of a cache, but S3 doesn't seem to support access time natively, so I think I'd have to do something weird involving adding custom metadata attributes to every file created with this method and updating them on load. I'm 95% sure it's possible but didn't want to have to deal with it yet :)

Completely understandable. It shouldn't be an issue for a while in any case. :)

Add a guard in `S3StoreBackend.delete_location()` checking that the
location to delete ends with `self.location` so it's less likely to
accidentally delete some other directory.
@atspaeth atspaeth merged commit 80761b7 into master Sep 20, 2023
@atspaeth atspaeth deleted the memoize_s3 branch September 20, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants