Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Add satisfiability check for case variants #1079

Merged
merged 14 commits into from
Sep 18, 2017

Conversation

sdboyer
Copy link
Member

@sdboyer sdboyer commented Aug 29, 2017

What does this do / why do we need it?

This introduces a new satisfiability check in the solver that ensures we don't have any import paths that vary only by path. It's actually really just ProjectRoots, not import paths as a whole, because internal package paths are implicitly verified to be the right case by the package existence checker - that's grounded in the reality of a case-sensitive comparison against what's been read from disk and built into the PackageTree.

The effect of this is that dep will only allow one case-variant of any given import path in a solution/a Gopkg.lock. It will search until it finds combinations of versions of projects that maintain this invariant - as well as all other satisfiability criteria, like version constraints - or fail out with an informative message if no such combination exists.

What should your reviewer look out for in this PR?

still need a handful more tests to cover the combinations

Do you need help or clarification on anything?

Which issue(s) does this PR fix?

several, at least.

fixes #433
fixes #797
fixes #806

}

func (s *selection) popDep(id ProjectIdentifier) (dep dependency) {
deps := s.deps[id.ProjectRoot]
dep, s.deps[id.ProjectRoot] = deps[len(deps)-1], deps[:len(deps)-1]

prlist := s.prLenMap[len(id.ProjectRoot)]
s.prLenMap[len(id.ProjectRoot)] = prlist[:len(prlist)-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this supposed to do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the question more about what prLenMap does in general, or just this specific segment here?

it's popping the last element off the slice of project roots on the prLenMap. it's dual to append we do up in *selection.pushDep().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense now. Looks like I forgot Go slice semantics. 😛

}

// TODO(sdboyer) bug here if it's possible that strings.ToLower() could
// change the length of the string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will dropping in utf8.RuneCountInString() for len() fix this?

What if the map was keyed on the lowercase string rather than their length or rune-count?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe i considered that approach initially, then dismissed it. lemme reconstruct my thought process...

Copy link
Collaborator

@jmank88 jmank88 Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you betting on len() being a shortcut/hueristic to do less ToLower calls overall?
I was thinking that if you can't use len() (O(1)), and are forced to scan each one to count the runes anyways (O(n)), then ToLower() isn't any more complex at least. But there might be other things to consider, or constant factors dominating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't reconstruct the original thought process, but it's fine, because i just bit the bullet and dug into how the toolchain does this.

They have a special ToFold() helper, which generates a stable, folded representation of a string that we can key the map on. at least, i guess that's what it does - i thought this was harder. but, we're at least sorta covered here, because the compiler is the ultimate arbiter of what's acceptable here. of course, i'm relying on current implementation of just one compiler, not a spec, and maybe case-insensitive filesystems will disagree about this definition of case equivalence...but now we're in the domain of some truly far-fetched cases, so whatever.

ToFold() is still O(n) even in its fast case, but we no longer need the slice on the other side, so we have fewer checks to do on the other side of the lookup. So, likely still slower than the O(1) len(), but at least we get something back. Either way, it's worth it to not have to care about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lololol apparently i found toFold() before - #433 (comment) - and forgot about it when i actually sat down to implement this

@sdboyer
Copy link
Member Author

sdboyer commented Aug 31, 2017

ugh lol this is metastasizing a bit (as solver checks kinda tend to) - i realized that there are more cases to address slightly differently:

  1. when there are case-variant imports within a single project
  2. when the addressing project uses a different casing than the project uses when it references itself. this is actually the case for logrus - which self-references as github.com/sirupsen/logrus - but people rarely run into it, because not many people use its syslog subpackage.

i'm gonna punt on the first one, because it's kinda vanishingly unlikely to happen right now, and i can't think of a way to do it without inducing a relatively expensive validation check in the main solving loop. a good solution down the road might be to precalculate error conditions like that in ListPackages(), as that would allow them to be done once and cached, and referenced later as needed.

"as needed" may also include up in dep, outside of the solver, for case-variant imports within the root project. alternately, and probably preferably, it might also be checked as part of gps.Prepare().

the second item, though, does need to be done now, and kinda needs its own failure type. it needs to be clearly opinionated about the fact that the dependers are doing the wrong thing, whereas caseMismatchFailure is more ¯\_(ツ)_/¯, because it doesn't know which one is "right", as it's operating on a "first past the post" basis.

@sdboyer
Copy link
Member Author

sdboyer commented Sep 2, 2017

ok, this has grown more. in the course of getting tests for the alternate failure type (also now implemented) where we can unambiguously infer canonical import path information from the internal import path patterns (num 2 above), i ran into difficulties getting the harnesses to behave. so, now the gps solver testing harness has effectively been made case-insensitive for the "root portion" (up to the first /) of import paths.

this is a significant choice, because it's now basically just dumb to not follow suit in the real SourceManager. at that point, though, we're really instituting global policy, saying that import paths essentially have the same rules as HFS - case-insensitive+case-preserving. that is effectively the policy that the compiler applies, so i think everything might be 👌 with that - but i definitely need to ponder a little more. also gonna talk to the Go team about it.

in looking at all that, i also realized that we have a nasty bug - on a case-insensitive filesystem, having case-variant project roots results in a situation where there are multiple sourceGateways for a single repo on disk. this can result in horrible undefined behavior, and it has to be fixed.

@sdboyer sdboyer changed the title [WIP] Add satisfiability check for case variants Add satisfiability check for case variants Sep 4, 2017
}

f := func(name string, pi1, pi2 ProjectIdentifier) {
t.Run(name, func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBD, but for cases like this, sometimes I like to define a struct for the test state, with a run method:

type testName struct {...}
func (testName) run(*testing.T) {...}

Then instead of:

f("folded first", folded, casevar1)
f("folded second", casevar1, folded)
f("both unfolded", casevar1, casevar2)

You get:

t.Run("folded first", testName{folded, casevar1}.run)
t.Run("folded second",testName{casevar1, folded}.run)
t.Run("both unfolded", testname{casevar1, casevar2}.run)

Which reads nicely, and avoids the multi-level closure and indentation hell that sometimes comes with these sorts of tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm yes, good point, i am being lazy with these. table-based declaration of them is more generally standard and does look better. i keep seeing your changes come in and thinking "i should remember to declare like that 😛 "

@sdboyer
Copy link
Member Author

sdboyer commented Sep 5, 2017

i'm backing away from the HFS analogy, as it isn't great. at the very least, it's misleading, as the operations we perform aren't terribly analogous to the ones performed by filesystems.

in any case, i think we're now at a stable spot with this. the core checks are in place in the solver itself, both the testing harness and the SourceManager are updated to treat project roots as case insensitive in roots, and there are nine new test cases that cover what seem like (read: based on intuition, not a rigorous combinatorial matrix) the combinations of this new satisfiability check with existing system properties.

in general, the new satisfiability check is just a big win. there are really no cases that were working before that it cuts out - it just prevents dep from accepting solutions which already weren't going to result in a compilable build. the only drawback is the performance cost: we now have to perform case folding on each external import root at each step in the solver. we don't have benchmarks (there's another TODO #896) to know the actual impact of that, and it's all masked now anyway by the constant-factor costs of network and disk interaction.

i'm slightly less bullish on treating root portions of paths as case-insensitive. we are, of course, within the bounds of what the compiler enforces by doing it, but we're also reinterpreting that logic in a different domain (local disk vs. network). but this PR is effectively deciding that all code hosts treat the root portion of the code they host as case-insensitive. even if it's considered bad practice to vary only by case, i can't imagine all code hosts actually enforce this as a rule in the way that we now effectively assume they do.

the reasons to do this anyway, despite that risk, are:

  1. it does seem very likely names differentiated only by case NEED to happen, and making it a rule in our tool will probably mostly have the effect of further disincentivizing the bad practice.
  2. if and when we encounter a real problem with this, changing dep's behavior will not be terribly costly, or change the meaning of historical Gopkg files.

as the code is currently written, the risks are:

  1. we are using the case-folded name when we write to disk. that counts as "storing" as case-folded representation of a string, and the unicode docs specifically indicate that that's a not a good idea. (counter: import paths are likely to be small, and far more likely to be ASCII-only than normal human text - yes, i know and am concerned that i'm recommitting the original sin of text encodings there - so the likelihood of encountering any of the possible bogeymen related to case folding is near-nil)
  2. the design goal here is that repos are stored in GOPATH/pkg/dep/sources using case-folded paths. any dep run accessing any case variant of those paths, from any project, will end up using the case folded version. that's a Good Thing, and the goal, the host treats these components as case insensitive. but, if there ARE two distinct projects that vary only by case, then whichever of those happens to be accessed first will end up being the permanent resident under GOPATH/pkg/dep/sources on that machine, and the behavior of the system will be undefined (counter: if/when we actually encounter this, there are checks we could conceivable add to at least warn that it's the case. it seems premature to add them before we are sure this is a realistic problem, though.)
  3. we need to be cognizant of the much more common case that a) a host is case sensitive, b) there is NOT a case-variant source for one of its repos, and c) a case-variant import path is provided. if GOPATH/pkg/dep/sources is already populated locally, then that case-variant import path will seem to be OK (because we have a repo for it already), even though there is no corresponding upstream. all of our maybeSource reach out to the internet right now to check upstream existence, so i don't think this can be a problem right now, but when we get around to severing from the upstream a bit, we'll need to be careful.

@ChimeraCoder
Copy link

if there ARE two distinct projects that vary only by case

I'm very curious if this has ever happened intentionally. I can't think of any example, and so far, neither has Twitter. Also, if what dmorsing is saying is correct, this situation will already cause the Go compiler to fail to build, which would mean it's not necessary for dep to solve it. (In other words, if it's a build failure, it's unlikely to be the case with any packages that exist today, and if it happens tomorrow for a new package, well, even without dep, it'd still provide a build failure). We should confirm that, though!

As for the case folding, which is what you pinged me about: My reading of the Unicode docs implies that this isn't a terrible idea (albeit not recommended):

Case mapping or case conversion is a process whereby strings are converted to a particular form—uppercase, lowercase, or titlecase—possibly for display to the user. Case folding is mostly used for caseless comparison of text, such as identifiers in a computer program, rather than actual text transformation. Case folding in Unicode is primarily based on the lowercase mapping, but includes additional changes to the source text to help make it language-insensitive and consistent. As a result, case-folded text should be used solely for internal processing and generally should not be stored or displayed to the end user

In this case, the folded "text" that we're storing is actually the key in a key/value store, whereas the docs seem to be written with the "value" mostly in mind. (For example, according to the Unicode docs, we shouldn't fold the source code itself and store that - even aside from the fact that it'd completely break compilation in Go).

That said, I'm not wholly convinced that the folding is necessary on-disk. Presumably the contents would be identical in both directories. While that's a minor waste of space, vendoring itself is a solution that only makes sense of we treat disk space as a resource abundant enough not to require minor optimizations. And from a version control perspective, git will deduplicate the files, so the amount of additional space needed in the repository is negligible. (I'm less familiar with other version control systems, to be honest, but I think most should handle this reasonably).

As you mentioned, the logic for the solver (ensuring that these versions are treated identically) has to exist anyway, so other than saving a few bytes on disk, I don't see a strong benefit to dropping that piece of information. (That is, the information of which casing was used to access a library at the time it was vendored by dep).

@sdboyer
Copy link
Member Author

sdboyer commented Sep 5, 2017

awesome, thanks for taking the time on this! 😄

this situation will already cause the Go compiler to fail to build, which would mean it's not necessary for dep to solve it. (In other words, if it's a build failure, it's unlikely to be the case with any packages that exist today, and if it happens tomorrow for a new package, well, even without dep, it'd still provide a build failure). We should confirm that, though!

he is indeed correct - that's what i reference earlier on in here as being the rule in the compiler. this is where that check is. and this comment has a user running into it: #433 (comment).

which would mean it's not necessary for dep to solve it.

it's kinda the other way around, actually. in order to produce a depgraph that the compiler will find acceptable, this PR introduces checks that make it impossible for case-only differences in import paths to exist in any solution that it finds. (that was the original goal of this PR; everything related to filesystems and storage is actually just a knock-on effect of addressing this original problem in the solver). otherwise, dep will pick out a set of dependencies that won't actually work, and won't even be writable on a case-insensitive filesystem (e.g. #797).

people end up having to resolve this crap manually, which has been an arduously difficult process for a number of users already. these changes can't fix it for them, but it at least tells people which of their dependencies are using problematic imports and need to be fixed, as well as attempting to find versions of those dependencies that don't have a problem.

i should have an example of what the -v output looks like, or something, in order to make this clearer.

In this case, the folded "text" that we're storing is actually the key in a key/value store, whereas the docs seem to be written with the "value" mostly in mind.

coooool cool good, ok. yes, that makes a lot of sense, and assuages my concerns.

That said, I'm not wholly convinced that the folding is necessary on-disk.

it may well not be. i opted for this approach mostly because uniformity seemed beneficial. but, some things to clarify:

Presumably the contents would be identical in both directories. While that's a minor waste of space, vendoring itself is a solution that only makes sense of we treat disk space as a resource abundant enough not to require minor optimizations.

indeed, i think the "avoiding waste" is not a terribly good argument for doing the on-disk folding. the crucial requirement here is rather that we strictly control there being only a single object (a sourceGateway) that manages each physical repository in-memory at any given time - this is how we prevent multiple e.g. git operations from running at once. prior to the changes in the last few commits here, on a case-insensitive filesystem, github.com/Sirupsen/logrus and github.com/sirupsen/logrus would each get their own object in memory, but both would be pointing to the same filesystem node. that absolutely can't happen.

but doing that doesn't necessarily entail using a folded case on the filesystem itself - only that in the in-memory maps, we keep the folded case on hand for lookup purposes, so that subsequent calls to fetch the sourceGateway can correctly converge on the right sourceGateway for the family of case variants that share the same folded form.

the network activity is a tad more concerning than the disk usage, but still probably negligible. however, it may become more of a pain in the future - e.g., under #431, it may become useful for people to forcefully clear the caches for a particular dependency. (hopefully not, but...) in such cases, it seems to me it might be easier that if we generally follow the pattern of keying on the case-folded-form everywhere, we might avoid gotchas in that arena.

(That is, the information of which casing was used to access a library at the time it was vendored by dep).

just to be totally clear, that's not actually the moment we're talking about here. what appears in your Gopkg.lock and vendor are unaffected by this change - all original casings are preserved there. rather, the moment in question is when dep first sees a given import path. the initial clone, into e.g. GOPATH/pkg/dep/sources/https---github.com-Sirupsen-logrus - will determine what directory the system operates from in all future dep runs. if we fold the case in the "key" there, then what we're gaining is a guarantee that on a case-sensitive filesystem, we'll never unnecessarily end up cloning and managing GOPATH/pkg/dep/sources/https---github.com-Sirupsen-logrus after GOPATH/pkg/dep/sources/https---github.com-sirupsen-logrus was already created.

¯\_(ツ)_/¯

@sdboyer
Copy link
Member Author

sdboyer commented Sep 5, 2017

here's some sample output from the tests introduced in the PR:

[gps] go test -v -run=TestBimodalSolves/case_variations_across_multiple
=== RUN   TestBimodalSolves
=== RUN   TestBimodalSolves/case_variations_across_multiple_deps
--- PASS: TestBimodalSolves (0.00s)
    --- PASS: TestBimodalSolves/case_variations_across_multiple_deps (0.00s)
    	writer.go:27: Root project is "root"
    	writer.go:27:  1 transitively valid internal packages
    	writer.go:27:  2 external packages imported from 2 projects
    	writer.go:27: (0)   ✓ select (root)
    	writer.go:27: (1)	? attempt bar with 1 pkgs; 1 versions to try
    	writer.go:27: (1)	    try bar@1.0.0
    	writer.go:27: (1)	✓ select bar@1.0.0 w/1 pkgs
    	writer.go:27: (2)	? attempt foo with 1 pkgs; 1 versions to try
    	writer.go:27: (2)	    try foo@1.0.0
    	writer.go:27: (2)	✓ select foo@1.0.0 w/1 pkgs
    	writer.go:27: (3)	? attempt baz with 1 pkgs; 1 versions to try
    	writer.go:27: (3)	    try baz@1.0.0
    	writer.go:27: (4)	✗   case-only variation in dependency on "Bar"; "bar" already established by:
    	writer.go:27: (4)	  (root)
    	writer.go:27: (4)	  foo@1.0.0
    	writer.go:27: (3)	  ← no more versions of baz to try; begin backtrack
    	writer.go:27: (2)	← backtrack: no more versions of foo to try
    	writer.go:27: (2)	← backtrack: no more versions of foo to try
    	writer.go:27: (1)	← backtrack: no more versions of bar to try
    	writer.go:27:   ✗ solving failed
    	writer.go:27: Solver wall times by segment:
    	writer.go:27:             new-atom: 279.154µs
    	writer.go:27:              satisfy:  239.19µs
    	writer.go:27:          select-root:  118.18µs
    	writer.go:27:          select-atom:  81.488µs
    	writer.go:27:            backtrack:  22.041µs
    	writer.go:27:             unselect:  16.365µs
    	writer.go:27:               b-gmal:   7.238µs
    	writer.go:27:   b-deduce-proj-root:   7.202µs
    	writer.go:27:                other:   3.192µs
    	writer.go:27:      b-source-exists:     604ns
    	writer.go:27:   TOTAL: 774.654µs

it's the case-only variation failure bit that's been added here.

there's a more verbose failure message that gets dumped than the one in the tracer, but...well, yeah, suggestions on the wording of these failure messages is also welcome 😄

@sdboyer sdboyer closed this Sep 10, 2017
@sdboyer sdboyer reopened this Sep 10, 2017
This lifts the exact folding algorithm and use pattern followed by the
toolchain up into gps. Not only does it solve the strings.ToLower()
inadequacy, but it means we're using the exact same logic the
go compiler does to decide this same question.
May or may not end up using this right away, but it'll be in place for
when we have the slightly stronger failure case of a project being
addressed with an incorrect case, as indicated by the project's way
of referencing its own packages.
This effectively makes them case-insensitive, case-preserving.
@sdboyer
Copy link
Member Author

sdboyer commented Sep 11, 2017

the rabbit hole kinda keeps going down with this one. one thing, for example, that i need to look at - must import comments be byte-literal matches, or do they case fold as well? this could end up mattering, as these casing rules start spreading their tendrils through dep :(

@sdboyer
Copy link
Member Author

sdboyer commented Sep 11, 2017

ugh...actually, so, that latest change just unconditionally always operates on the folded form of URLs when interacting with remote services. that's truly assuming that they're case-insensitive, rather than the weaker case where they can be case-sensitive, but still disallow case-only variations in the data they host. the latter seems much safer.

need to make another change tomorrow to accommodate that.

Keeping track of what maps to what in the sourceGateway setup can be
really tricky with all the combinations; in the event of failures in
this test, this will show the mapping tables, which helps a lot with
understanding the actual final state.
But, still preserve the rule that we record the canonical folded URL in
memory, so that we can have non-canonical inputs come in first and still
converge with subsequent canonical, or other-case-variant forms later.
@sdboyer
Copy link
Member Author

sdboyer commented Sep 18, 2017

OK, those issues are now ameliorated - we don't case-fold what we write to disk, but we do case-fold in memory. This being case sensitivity, I imagine there's still gremlins running around somewhere, but I think they've been banished sufficiently far underground that we won't hear from them for a while.

@sdboyer sdboyer merged commit 5aa4ffe into golang:master Sep 18, 2017
@sdboyer sdboyer mentioned this pull request Sep 19, 2017
var buf bytes.Buffer

str := "Could not introduce %s due to a case-only variation: it depends on %q, but %q was already established as the case variant for that project root by the following other dependers:\n"
fmt.Fprintf(&buf, str, e.goal.dep.Ident.ProjectRoot, e.current, a2vs(e.goal.depender))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of arguments here is wrong. They should match the arguments to the previous fmt.Sprintf call, with the a2vs first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already merged, so a PR fixing it is preferred :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I ran into this deep into a concerted effort to port a project from govendor at the end of the work week, so I figured a quick note was better than nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sure, better to get the note down when you have the eureka moment. i'll take a PR whenever you find some time 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants