Improve pex resolver caching for local dist resolves #453

CMLivingston · 2018-03-02T00:19:06Z

Solution for: #452

Problem:

Currently, the pex resolver handles resolving dists and caches them using the name of the wheel in the pex resolver cache. This can be problematic if two locally-built (and locally resolved via a find-links fetcher) setup.py-based Python projects have the same name and version info in their respective setup.py files. This could lead to undefined behavior in the event that there is a collision in the pex resolver cache.

Solution:

Use os.stat file size in bytes to create cache keys for wheel files in the resolver cache based on their contents. Create subdirectories in the cache dir based on these cache keys and store the wheel file in its unique directory. Create a current directory in the cache dir for crawling the latest packages, and compare the cache key of the latest version of the package to the cache key of the first non-matching local package found by the Resolver package iterator. (This is key for managing performance impact; if we do not filter out packages that do not use the file:// protocol, the current implementation will parse and filter all sdists and bdists found on pypi and remote links, and these remote links will not/should not have the version content conflict as described by the core problem this PR aims to solve). If the keys do not match, invalidate the latest matching wheel in the cache, and cache the new wheel and use it to resolve.

Result:

Wheels with the same name and version but different contents will by treated differently by the pex resolver cache. This implies that sdists for local Python projects (crawled from a find-links fetcher to a file system location) will appear differently to the resolver cache when the underlying dist contents change, triggering a cache invalidation when the project contents change but the dist name and version do not.

Performance impacts

Here is some preliminary data on the impact of build time caching performance on requirements of small, medium-small, medium, and large sizes, averaged over 3 runs for each.

Pex `CachedResolver#resolve` performance (in seconds)

Master branch: 

q: 0.0771693388621
q (cached run): 0.0517366727193

django==1.6: 0.714973926544
django==1.6 (cached run): 0.196810007095

pandas: 1.91385062536
pandas (cached run): 0.865164756775

tensorflow: 6.24201631546
tensorflow (cached run): 1.94899066289


Feature branch with improved caching strategy:

q: 0.0724590619405
q (cached run): 0.0489939848582

django==1.6: 0.831711292267
django==1.6 (cached run): 0.183348735174

pandas: 2.54747994741
pandas (cached run): 0.964469353358

tensorflow: 7.79748272896
tensorflow (cached run): 4.38255461057

Note that runtime caching performance will by completely unaffected by any of these chages. The impact would only be on the time it takes to build a pex.

…ed with sha1 of dist contents, store dist there.

…rectories

…ed version of a bdist. Streamline first integration test

…package search

UnrememberMe

I only leave my comments and let the owners to approve it.

UnrememberMe · 2018-03-09T00:04:20Z

pex/resolver.py

+        return (hash(x) != hash(cached_wheel_pkg) and
+               x.name == cached_wheel_pkg.name and
+               x.raw_version == cached_wheel_pkg.raw_version and
+               'file://' in x.url)


move the hash comparison to the last item so that we have chance to short circulate without the need to run a more expensive hash function.

UnrememberMe · 2018-03-09T00:11:59Z

pex/resolver.py

+               x.name == cached_wheel_pkg.name and
+               x.raw_version == cached_wheel_pkg.raw_version and
+               'file://' in x.url)
+      for pkg in all_packages:


return next(p for p in all_packages if predicate(p), None)
next should be faster than the for loop.

UnrememberMe · 2018-03-09T00:46:47Z

pex/resolver.py

+      dist_filename, self.get_cache_key(dist.location)))
+    safe_mkdir(cached_location)
+
+    target = os.path.join(cached_location, dist_filename)


could do something like

for location in [cached_location, self.ensure_current_dir()]: target = os.path.join(location, dist_filename) ...

illicitonion

Looks good to me :) Just one comment on the cache key

(Will also wait for someone who knows the code better, like @kwlzn, to have a look)

illicitonion · 2018-03-12T12:16:56Z

pex/resolver.py

@@ -247,16 +263,27 @@ def filter_packages_by_ttl(cls, packages, ttl, now=None):
    return [package for package in packages
        if package.remote or package.local and (now - os.path.getmtime(package.local_path)) < ttl]

+  @classmethod
+  def get_cache_key(cls, dist_location):
+    return os.stat(dist_location).st_size


Rather than size, can we use something more robust like a content digest?

Looking at django 1.6 (which is ~7MB) which you profiled:

tmp$ time md5 Django-1.6.11-py2.py3-none-any.whl MD5 (Django-1.6.11-py2.py3-none-any.whl) = e43d8c4147a821205b554aebc1da62d5 real 0m0.025s user 0m0.019s sys 0m0.008s tmp$ time shasum -a 1 Django-1.6.11-py2.py3-none-any.whl 0d7d9071158294930ae86b776eb9ff9011eb1d7b Django-1.6.11-py2.py3-none-any.whl real 0m0.076s user 0m0.047s sys 0m0.021s tmp$ time shasum -a 256 Django-1.6.11-py2.py3-none-any.whl 6f2cc848b2b72adf53a6f3f21be049c477e82c408bce7cedb57efeb0984bde24 Django-1.6.11-py2.py3-none-any.whl real 0m0.100s user 0m0.062s sys 0m0.023s

I'm definitely happy to add 50ms per 7M dependency to pex building. I could also live with an extra 200ms.

Thanks for taking a look! I agree that a hash digest would certainly be more robust. I changed it to size purely to squeeze out some perf, but it may make sense to switch back. I'll wait for some additional feedback on the perf impact.

CMLivingston · 2018-11-06T21:53:08Z

This was addressed in #6022. Closing.

Chris Livingston added 9 commits February 23, 2018 15:28

First pass at resolver cache improvement: add subdirs in cachedir nam…

f0e6efb

…ed with sha1 of dist contents, store dist there.

Add new param to fetcher constructor for including links to all subdi…

1411469

…rectories

Fix failing inherit_path integration tests

d4835dd

Read wheel file contents in binary mode

9df35eb

Pass at integration test for changes

a1cb5bc

Implement a current directory in the cache dir to store the latest us…

ce438ff

…ed version of a bdist. Streamline first integration test

Revert changes to inherit_path tests

8757683

Basic isort and style fixes

ae5e17a

First pass at improvements to pex resolver cache

7ee6670

CMLivingston changed the title ~~Improve pex resolver caching~~ Improve pex resolver caching for local dist resolves Mar 2, 2018

CMLivingston changed the title ~~Improve pex resolver caching for local dist resolves~~ (WIP) Improve pex resolver caching for local dist resolves Mar 3, 2018

Chris Livingston added 2 commits March 5, 2018 11:37

Use stat to get file size in bytes for fingerprinting, short-circuit …

6c4f2fd

…package search

Refactor to fix breaking test

f3bef49

CMLivingston changed the title ~~(WIP) Improve pex resolver caching for local dist resolves~~ Improve pex resolver caching for local dist resolves Mar 5, 2018

Only attempt to rebuild local packages

adb268e

UnrememberMe reviewed Mar 9, 2018

View reviewed changes

Chris Livingston added 2 commits March 9, 2018 14:52

Refactor per rjiang's comments for perf improvement

dcc8bf7

Use shutil copyfile instead of copy for current bdist

2376b75

illicitonion reviewed Mar 12, 2018

View reviewed changes

Revert lint fixes. Save for later review to declutter this PR.

83d0893

cosmicexplorer mentioned this pull request Jun 26, 2018

add the target fingerprint to the version of each local dist so that we don't use the first cached one pantsbuild/pants#6022

Merged

This was referenced Jul 4, 2018

[WIP] Improve resolver caching by using a hash of the wheel contents #520

Closed

Improve pex resolver caching for local dist resolves #452

Closed

CMLivingston closed this Nov 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve pex resolver caching for local dist resolves #453

Improve pex resolver caching for local dist resolves #453

CMLivingston commented Mar 2, 2018 •

edited

Loading

UnrememberMe left a comment

UnrememberMe Mar 9, 2018

UnrememberMe Mar 9, 2018

UnrememberMe Mar 9, 2018

illicitonion left a comment

illicitonion Mar 12, 2018

CMLivingston Mar 12, 2018

CMLivingston commented Nov 6, 2018

Improve pex resolver caching for local dist resolves #453

Improve pex resolver caching for local dist resolves #453

Conversation

CMLivingston commented Mar 2, 2018 • edited Loading

Problem:

Solution:

Result:

Performance impacts

UnrememberMe left a comment

Choose a reason for hiding this comment

UnrememberMe Mar 9, 2018

Choose a reason for hiding this comment

UnrememberMe Mar 9, 2018

Choose a reason for hiding this comment

UnrememberMe Mar 9, 2018

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

illicitonion Mar 12, 2018

Choose a reason for hiding this comment

CMLivingston Mar 12, 2018

Choose a reason for hiding this comment

CMLivingston commented Nov 6, 2018

CMLivingston commented Mar 2, 2018 •

edited

Loading