Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop haul-transect/transect-region requirements #274

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

brandynlucca
Copy link
Collaborator

This PR applies a bug-fix to the NASC ingestion procedure by deprecating the creation of the haul-transect and transection-region key external files. This removes the need to intermittently generate these files since the haul-transect map is not required until the biological data are ingested. In theory, there may be a need to retain reading in the biodata_gear.xlsx file for conserving precise transect information for the biological data (i.e. for bootstrapping/randomization). But this is not necessary for the currently available workflow, and possibly avoids issues in the future for datasets where the biodata_gear.xlsx file(s) have transect_num missing entirely.

@brandynlucca brandynlucca self-assigned this Sep 14, 2024
@brandynlucca brandynlucca added bug_in_python Something in the Python implementation does not match what's in Matlab design refactor labels Sep 14, 2024
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @brandynlucca : thanks for the PR, and sorry for the delay in review it. I put all my comments below instead of inline comments, since these touch upon broader context than the changes you made. Some are cosmetic (like reorganization) and we can discuss if it is necessary. I think the docs one will be good to have for v0.4.1 though.

read_validated_data

This section below is identical in both the if and else conditions under load_dataset - can they be merged (moved outside of the if-else)?

# Validate datatypes within dataset and make appropriate changes to dtypes
# ---- This first enforces the correct dtype for each imported column
# ---- This then assigns the imported data to the correct class attribute
read_validated_data(
    input_dict,
    configuration_dict,
    file_name,
    sheet_name,
    config_map,
    validation_settings,
)

map_imported_datasets

map_imported_datasets is used first under load_data and then under prepare_input_data, but prepare_input_data is called in load_data. There seems to be some redundancy here?

prepare_input_data

For prepare_input_data I have a few questions and thoughts:

  1. Are the if conditions designed for the cases when only a subset of these exist? What are those conditions -- I was under the impression that all of these need to exist for the workflow to execute.
  • Related to the above -- what happens when none of the if conditions is true?
  • In prepare_input_data there are clear sections handling different data types. The function is so long that I wonder if it makes sense to factor some of them out as standalone functions.
  • Related to the above, since prepare_input_data is only used once under load_data, depending on what the answers of 1-2 are, it might make sense to make individual functions for each data type and just call all of them in load_data?

docs

While reviewing the code, I was trying to understand what is in the config table and found terms such as "superlayer" and "name" pretty confusing. I couldn't find a documentation page that explains what's what in the config file -- is there one? If not, I think we should create such a page with a block diagram showing the flow of operations in load_data. There are multiple mapping and cleaning/subselection in the code, and this will help users/readers understand what is done and why, and potentially help with future refactoring/simplification.

about loading NASC

I think this is the first time I skimmed through the NASC loading functions, so 2 questions:

  • Are there currently tests for the result of load_nasc to compare values from those loaded into Matlab? If not, can that be done? Not sure if there is any bug in the Matlab code to prevent that.
  • What is the difference between read_echoview_export and batch_read_echoview_export?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug_in_python Something in the Python implementation does not match what's in Matlab design refactor
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Transect number in gear data Deprecate haul_to_transect_mapping_YYYY_COUNTRY.xlsx file requirement
2 participants