Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve pex resolver caching for local dist resolves #453

Conversation

CMLivingston
Copy link
Contributor

@CMLivingston CMLivingston commented Mar 2, 2018

Solution for: #452

Problem:

Currently, the pex resolver handles resolving dists and caches them using the name of the wheel in the pex resolver cache. This can be problematic if two locally-built (and locally resolved via a find-links fetcher) setup.py-based Python projects have the same name and version info in their respective setup.py files. This could lead to undefined behavior in the event that there is a collision in the pex resolver cache.

Solution:

Use os.stat file size in bytes to create cache keys for wheel files in the resolver cache based on their contents. Create subdirectories in the cache dir based on these cache keys and store the wheel file in its unique directory. Create a current directory in the cache dir for crawling the latest packages, and compare the cache key of the latest version of the package to the cache key of the first non-matching local package found by the Resolver package iterator. (This is key for managing performance impact; if we do not filter out packages that do not use the file:// protocol, the current implementation will parse and filter all sdists and bdists found on pypi and remote links, and these remote links will not/should not have the version content conflict as described by the core problem this PR aims to solve). If the keys do not match, invalidate the latest matching wheel in the cache, and cache the new wheel and use it to resolve.

Result:

Wheels with the same name and version but different contents will by treated differently by the pex resolver cache. This implies that sdists for local Python projects (crawled from a find-links fetcher to a file system location) will appear differently to the resolver cache when the underlying dist contents change, triggering a cache invalidation when the project contents change but the dist name and version do not.

Performance impacts

Here is some preliminary data on the impact of build time caching performance on requirements of small, medium-small, medium, and large sizes, averaged over 3 runs for each.

Pex `CachedResolver#resolve` performance (in seconds)

Master branch: 

q: 0.0771693388621
q (cached run): 0.0517366727193

django==1.6: 0.714973926544
django==1.6 (cached run): 0.196810007095

pandas: 1.91385062536
pandas (cached run): 0.865164756775

tensorflow: 6.24201631546
tensorflow (cached run): 1.94899066289


Feature branch with improved caching strategy:

q: 0.0724590619405
q (cached run): 0.0489939848582

django==1.6: 0.831711292267
django==1.6 (cached run): 0.183348735174

pandas: 2.54747994741
pandas (cached run): 0.964469353358

tensorflow: 7.79748272896
tensorflow (cached run): 4.38255461057

Note that runtime caching performance will by completely unaffected by any of these chages. The impact would only be on the time it takes to build a pex.

@CMLivingston CMLivingston changed the title Improve pex resolver caching Improve pex resolver caching for local dist resolves Mar 2, 2018
@CMLivingston CMLivingston changed the title Improve pex resolver caching for local dist resolves (WIP) Improve pex resolver caching for local dist resolves Mar 3, 2018
@CMLivingston CMLivingston changed the title (WIP) Improve pex resolver caching for local dist resolves Improve pex resolver caching for local dist resolves Mar 5, 2018
Copy link

@UnrememberMe UnrememberMe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only leave my comments and let the owners to approve it.

pex/resolver.py Outdated
return (hash(x) != hash(cached_wheel_pkg) and
x.name == cached_wheel_pkg.name and
x.raw_version == cached_wheel_pkg.raw_version and
'file://' in x.url)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the hash comparison to the last item so that we have chance to short circulate without the need to run a more expensive hash function.

pex/resolver.py Outdated
x.name == cached_wheel_pkg.name and
x.raw_version == cached_wheel_pkg.raw_version and
'file://' in x.url)
for pkg in all_packages:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return next(p for p in all_packages if predicate(p), None)
next should be faster than the for loop.

pex/resolver.py Outdated
dist_filename, self.get_cache_key(dist.location)))
safe_mkdir(cached_location)

target = os.path.join(cached_location, dist_filename)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do something like

for location in [cached_location, self.ensure_current_dir()]:
  target = os.path.join(location, dist_filename)
  ...

Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :) Just one comment on the cache key

(Will also wait for someone who knows the code better, like @kwlzn, to have a look)

@@ -247,16 +263,27 @@ def filter_packages_by_ttl(cls, packages, ttl, now=None):
return [package for package in packages
if package.remote or package.local and (now - os.path.getmtime(package.local_path)) < ttl]

@classmethod
def get_cache_key(cls, dist_location):
return os.stat(dist_location).st_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than size, can we use something more robust like a content digest?

Looking at django 1.6 (which is ~7MB) which you profiled:

tmp$ time md5 Django-1.6.11-py2.py3-none-any.whl
MD5 (Django-1.6.11-py2.py3-none-any.whl) = e43d8c4147a821205b554aebc1da62d5

real	0m0.025s
user	0m0.019s
sys	0m0.008s
tmp$ time shasum -a 1 Django-1.6.11-py2.py3-none-any.whl
0d7d9071158294930ae86b776eb9ff9011eb1d7b  Django-1.6.11-py2.py3-none-any.whl

real	0m0.076s
user	0m0.047s
sys	0m0.021s
tmp$ time shasum -a 256 Django-1.6.11-py2.py3-none-any.whl
6f2cc848b2b72adf53a6f3f21be049c477e82c408bce7cedb57efeb0984bde24  Django-1.6.11-py2.py3-none-any.whl

real	0m0.100s
user	0m0.062s
sys	0m0.023s

I'm definitely happy to add 50ms per 7M dependency to pex building. I could also live with an extra 200ms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! I agree that a hash digest would certainly be more robust. I changed it to size purely to squeeze out some perf, but it may make sense to switch back. I'll wait for some additional feedback on the perf impact.

@CMLivingston
Copy link
Contributor Author

This was addressed in #6022. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants