Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling RECORD files with duplicate entries #5913

Open
1 of 3 tasks
uranusjr opened this issue Oct 24, 2018 · 16 comments · May be fixed by #11762
Open
1 of 3 tasks

Handling RECORD files with duplicate entries #5913

uranusjr opened this issue Oct 24, 2018 · 16 comments · May be fixed by #11762
Labels
C: wheel The wheel format and 'pip wheel' command state: needs discussion This needs some more discussion type: enhancement Improvements to functionality

Comments

@uranusjr
Copy link
Member

uranusjr commented Oct 24, 2018

From #5868 (comment) and #5883 (comment)

The problem: RECORD may contain rows poiting to the same path, potentially with different sizes and hashes.

Points to discuss (may be expanded or modified):

  • Does PEP 376 or 427 need to be amended to explicitly prohibit this? If so, how?
    • No, the text is clear. Duplicate path entries make the RECORD invalid.
  • Should (and how should) pip perform some extra steps when such a RECORD file is present?
  • Should dynamically generated rows (e.g. console_scripts) be treated differently if they duplicate existing rows?
@cjerdonek cjerdonek added the state: needs discussion This needs some more discussion label Oct 24, 2018
@pfmoore
Copy link
Member

pfmoore commented Oct 24, 2018

IMO, PEP 376 is clear: "The RECORD file is a CSV file, composed of records, one line per installed file" (my emphasis). And PEP 427 says "list of (almost) all the files in the wheel" so I think that's OK too (it's reasonable to infer the "one line per file" restriction from PEP 376).

So based on my reading of the PEPs, duplicate filenames in RECORD are invalid.

I'm OK with pip reporting an error if there are duplicates. But I won't object if someone wants to argue for continuing with some sort of warning. Any argument for continuing should probably consider the security implications, though (the user might not end up with what they expect).

If we dynamically generate a file that clashes with a file that's supplied in the wheel, my view is that we should do the same (I tend towards reporting an error, but I'm OK with someone arguing for overwrite-and-continue).

Regarding "Should RECORD contain anything outside of the to-be-installed site-packages" I'm not sure what you're thinking of here. Can you clarify?

One possible sticking point is case insensitive vs case sensitive filesystems. If a wheel contains a.py and A.py, do they count as duplicates? What if you're installing the wheel on a case insensitive filesystem? I'd like to say that the safest view says that we should treat them as the same, but that may not work for people building wheels that are only ever intended to be used on Unix (although are there ever cases where you need 2 files that differ only in case). Either way, that is something that PEP 427 should be updated to make an explicit statement on.

@uranusjr
Copy link
Member Author

Regarding the “outside of the to-be-installed site-packages” part—a package can contain a RECORD rule like pip/__init__.py,,. Since this is copied to the installed dist-info, the file will be removed when you uninstall the package.

But on second thought, this is probably needed since a Python package really can write to anywhere. I’ll remove that part from the list.

@cjerdonek
Copy link
Member

If a wheel contains a.py and A.py, do they count as duplicates?

In terms of making forward progress, it might be simplest to decide first how to handle paths that are equal as strings, since that would be a lot easier. Then how to handle paths that match case-insensitively could be decided on later. The case insensitivity issue seems a lot trickier to me and less clear.

@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

Agreed. Sorry if I wasn't clear - case insensitivity is definitely a "consider it later" problem :-)

@uranusjr
Copy link
Member Author

So the next thing to consider is, what should pip do? I agree it makes sense to generate an error. It could make sense to use a warning instead (especially if the duplicate rows have matching hashes), but it’s easier to err on the restrictive side, and relax the rules if necessary.

I am not very familiar with this section of pip, but it doesn’t seem to perform any error handling right now. Would it be enough to simply raise an exception when this happens, and improve from there?

@cjerdonek
Copy link
Member

I don't know much about this area either. Is pip only reading RECORD files, or also writing new ones? In the case of reading, without knowing much, it doesn't seem like pip should error out if the hashes (and sizes) are the same (or compatible, like if one is empty). However, if the hashes and/or sizes are conflicting, then it seems like erroring out would be a good thing (because pip has no way of knowing which row to choose).

@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

AIUI, the only case that's actually been seen of this happening is a bad wheel file generated by poetry, which has since been superseded (and the poetry author agreed that the wheel was wrong). So an exception/traceback sounds sufficient at least for now. Let's not spend too much time on a mostly-theoretical issue.

Regarding where pip handles RECORD files, I'm not that familiar with this area either, but I think we do the following:

  1. Read RECORD files from wheels that we're installing (which is where this problem showed up, I believe).
  2. Write RECORD files when installing wheels (we shouldn't have a problem here - we're just recording what we wrote, so why would it include duplicates?)

I'd stick to fixing the case that the bad poetry wheel triggered, and worry about anything else when it happens. If we're saying that duplicates are invalid according to the PEPs, then it's not like there's an urgent need to tidy up any potential failures where that assumption is violated.

@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

More specifically, pip never writes RECORD files in wheels - that's the build backend's job. The only place where pip would ever write a record file is when installing, to the PEP 376 "database of installed packages" (dist-info directory in site-packages).

@cjerdonek
Copy link
Member

How does pip use the information in the RECORD file when reading it? Like, does it do anything with the hashes and sizes, etc, and what does it do with the paths that are listed (as opposed to paths not listed)?

@uranusjr
Copy link
Member Author

uranusjr commented Oct 25, 2018

# Record details of all files installed
record = os.path.join(info_dir[0], 'RECORD')
temp_record = os.path.join(info_dir[0], 'RECORD.pip')
with open_for_csv(record, 'r') as record_in:
with open_for_csv(temp_record, 'w+') as record_out:
reader = csv.reader(record_in)
writer = csv.writer(record_out)
outrows = []
for row in reader:
row[0] = installed.pop(row[0], row[0])
if row[0] in changed:
row[1], row[2] = rehash(row[0])
outrows.append(tuple(row))
for f in generated:
digest, length = rehash(f)
outrows.append((normpath(f, lib_dir), digest, length))
for f in installed:
outrows.append((installed[f], '', ''))
# Sort to simplify testing.
for row in sorted_outrows(outrows):
writer.writerow(row)
shutil.move(temp_record, record)

If my understanding of the code is correct, pip does not consult the wheel’s RECORD files, but do it the other way around: it copies all files in the wheel, and add/fix lines into the installed RECORD if they are missing/different from the wheel’s.

@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

So how were we getting an error from the bad poetry wheel? I'm confused now...

@cjerdonek cjerdonek added type: enhancement Improvements to functionality C: wheel The wheel format and 'pip wheel' command labels Oct 25, 2018
@uranusjr
Copy link
Member Author

The bad Poetry wheel’s RECORD contains a row duplicating one of generated. Line 525 above checks whether the wheel’s RECORD duplicates one of installed, but not generated.

installed (a set) contains paths to files copied from the wheel, generated contains paths of console and GUI scripts generated dynamically from the wheel’s entry point specification.

@uranusjr
Copy link
Member Author

Also the error it causes is during writing, not reading. pip sorts rows before it writes them (line 535); duplicate rows cause the previous sorting function to fail.

@cjerdonek
Copy link
Member

In terms of this issue, I'm guessing we all agree that pip should always only write valid RECORD files (and to error out if it can't).

However, when reading an existing RECORD file in order to write one, it seems like there are two types of errors that pip might encounter:

  1. cases where pip is able to "fix" the error with no question (e.g. a row that appears twice identically -- the fix is to write it only once, not twice), and
  2. cases where it's not clear how to fix the error.

For (2), I feel like pip should error out. But how do people feel about (1)? Also, was the Poetry error of type (1) or (2)?

@uranusjr
Copy link
Member Author

I think it can be categorised as (1). Since pip favours the generated script over the one included in wheel, it can ignore the row from the wheel’s RECORD.

@pradyunsg
Copy link
Member

I concur with @uranusjr's suggestion.

@cjerdonek cjerdonek changed the title Potential duplicate entries in RECORD Handling RECORD files with duplicate entries Jan 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: wheel The wheel format and 'pip wheel' command state: needs discussion This needs some more discussion type: enhancement Improvements to functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants