Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augur curate I/O: verify NDJSON records have the same fields #1510

Closed
joverlee521 opened this issue Jul 1, 2024 · 2 comments · Fixed by #1518
Closed

augur curate I/O: verify NDJSON records have the same fields #1510

joverlee521 opened this issue Jul 1, 2024 · 2 comments · Fixed by #1518
Assignees
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Context

Originally discussed in #1506 (comment)

augur curate records can be output to a metadata TSV file, which uses the first record's fields as output columns.

augur/augur/io/metadata.py

Lines 467 to 474 in f6ee377

# Use the keys of the first record as output fields
try:
first_record = next(records)
except StopIteration:
raise AugurError(f"Unable to write records to {output_file} because provided records were empty.")
# Use the record keys as output columns since as of python 3.7 dicts retain insertion order
output_columns = list(first_record.keys())

With that in mind, the centralized inputs parser should verify that the input records all of the same fields.
Then subcommands can operate under the assumption that all records should have the same fields and make changes accordingly.

@joverlee521 joverlee521 added the enhancement New feature or request label Jul 1, 2024
@jameshadfield
Copy link
Member

jameshadfield commented Jul 1, 2024

the centralized inputs parser should verify that the input records all of the same fields

Probably sensible to do this for outputs too, to assert the curate subcommand's run() always returns a consistent set of fields. I haven't checked if csv.DictWriter orders each row based on the information previously passed to writeheader() or if it relies on the insertion order being the same for each row, but if it's the latter then we should assert that as well.

@joverlee521 joverlee521 changed the title augur curate inputs: verify NDJSON records have the same fields augur curate I/O: verify NDJSON records have the same fields Jul 1, 2024
@jameshadfield
Copy link
Member

jameshadfield commented Jul 1, 2024

Some more details about TSV writing via our generalised write_records_to_tsv:

  • We don't need to worry about the order of TSV rows -- DictWriter will re-order each of them to match the header information provided in the set-up.
  • Fields missing from a row will be ignored (empty string used in their place). This is also noted in our docstring but it seems from the discussion in this issue that we may want to explicitly assert that each row has the same keys within augur curate.
  • Extra fields in a row are currently ignored - also noted in our docstring - but we may want to revisit this and assert it upstream within augur curate.

jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 1, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 2, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 2, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
genehack added a commit that referenced this issue Jul 2, 2024
Per discussion on #1508 and #1511, the field standardization across
records (cf #1510) makes the need to verify a `database` field less
important — essentially, if there's a `geo_loc_name` field (or a field
with the name given in the `--location-field` argument), parse it.
Otherwise, warn that it's not found.
@joverlee521 joverlee521 self-assigned this Jul 2, 2024
jameshadfield added a commit that referenced this issue Jul 4, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
jameshadfield added a commit that referenced this issue Jul 4, 2024
to match expected behaviour in tests.

The main changes functional changes are around the order of fields,
where we now rename "in-place" rather than adding the renamed column
at the end (which for TSV output is the last column).

More sanity checks are performed on arguments and they are
cross-referenced with the provided records.

Note that this relies on each record having the same fields, and this is
not asserted here. See <#1510>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants