-
-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve pex resolver caching for local dist resolves #453
Improve pex resolver caching for local dist resolves #453
Conversation
…ed with sha1 of dist contents, store dist there.
…ed version of a bdist. Streamline first integration test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only leave my comments and let the owners to approve it.
pex/resolver.py
Outdated
return (hash(x) != hash(cached_wheel_pkg) and | ||
x.name == cached_wheel_pkg.name and | ||
x.raw_version == cached_wheel_pkg.raw_version and | ||
'file://' in x.url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move the hash comparison to the last item so that we have chance to short circulate without the need to run a more expensive hash function.
pex/resolver.py
Outdated
x.name == cached_wheel_pkg.name and | ||
x.raw_version == cached_wheel_pkg.raw_version and | ||
'file://' in x.url) | ||
for pkg in all_packages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return next(p for p in all_packages if predicate(p), None)
next should be faster than the for loop.
pex/resolver.py
Outdated
dist_filename, self.get_cache_key(dist.location))) | ||
safe_mkdir(cached_location) | ||
|
||
target = os.path.join(cached_location, dist_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do something like
for location in [cached_location, self.ensure_current_dir()]:
target = os.path.join(location, dist_filename)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me :) Just one comment on the cache key
(Will also wait for someone who knows the code better, like @kwlzn, to have a look)
@@ -247,16 +263,27 @@ def filter_packages_by_ttl(cls, packages, ttl, now=None): | |||
return [package for package in packages | |||
if package.remote or package.local and (now - os.path.getmtime(package.local_path)) < ttl] | |||
|
|||
@classmethod | |||
def get_cache_key(cls, dist_location): | |||
return os.stat(dist_location).st_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than size, can we use something more robust like a content digest?
Looking at django 1.6 (which is ~7MB) which you profiled:
tmp$ time md5 Django-1.6.11-py2.py3-none-any.whl
MD5 (Django-1.6.11-py2.py3-none-any.whl) = e43d8c4147a821205b554aebc1da62d5
real 0m0.025s
user 0m0.019s
sys 0m0.008s
tmp$ time shasum -a 1 Django-1.6.11-py2.py3-none-any.whl
0d7d9071158294930ae86b776eb9ff9011eb1d7b Django-1.6.11-py2.py3-none-any.whl
real 0m0.076s
user 0m0.047s
sys 0m0.021s
tmp$ time shasum -a 256 Django-1.6.11-py2.py3-none-any.whl
6f2cc848b2b72adf53a6f3f21be049c477e82c408bce7cedb57efeb0984bde24 Django-1.6.11-py2.py3-none-any.whl
real 0m0.100s
user 0m0.062s
sys 0m0.023s
I'm definitely happy to add 50ms per 7M dependency to pex building. I could also live with an extra 200ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look! I agree that a hash digest would certainly be more robust. I changed it to size purely to squeeze out some perf, but it may make sense to switch back. I'll wait for some additional feedback on the perf impact.
This was addressed in #6022. Closing. |
Solution for: #452
Problem:
Currently, the pex resolver handles resolving dists and caches them using the name of the wheel in the pex resolver cache. This can be problematic if two locally-built (and locally resolved via a find-links fetcher) setup.py-based Python projects have the same name and version info in their respective setup.py files. This could lead to undefined behavior in the event that there is a collision in the pex resolver cache.
Solution:
Use os.stat file size in bytes to create cache keys for wheel files in the resolver cache based on their contents. Create subdirectories in the cache dir based on these cache keys and store the wheel file in its unique directory. Create a
current
directory in the cache dir for crawling the latest packages, and compare the cache key of the latest version of the package to the cache key of the first non-matching local package found by theResolver
package iterator. (This is key for managing performance impact; if we do not filter out packages that do not use thefile://
protocol, the current implementation will parse and filter all sdists and bdists found on pypi and remote links, and these remote links will not/should not have the version content conflict as described by the core problem this PR aims to solve). If the keys do not match, invalidate the latest matching wheel in the cache, and cache the new wheel and use it to resolve.Result:
Wheels with the same name and version but different contents will by treated differently by the pex resolver cache. This implies that sdists for local Python projects (crawled from a find-links fetcher to a file system location) will appear differently to the resolver cache when the underlying dist contents change, triggering a cache invalidation when the project contents change but the dist name and version do not.
Performance impacts
Here is some preliminary data on the impact of build time caching performance on requirements of small, medium-small, medium, and large sizes, averaged over 3 runs for each.
Note that runtime caching performance will by completely unaffected by any of these chages. The impact would only be on the time it takes to build a pex.