-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop haul-transect/transect-region requirements #274
base: main
Are you sure you want to change the base?
Drop haul-transect/transect-region requirements #274
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @brandynlucca : thanks for the PR, and sorry for the delay in review it. I put all my comments below instead of inline comments, since these touch upon broader context than the changes you made. Some are cosmetic (like reorganization) and we can discuss if it is necessary. I think the docs one will be good to have for v0.4.1 though.
read_validated_data
This section below is identical in both the if and else conditions under load_dataset
- can they be merged (moved outside of the if-else)?
# Validate datatypes within dataset and make appropriate changes to dtypes
# ---- This first enforces the correct dtype for each imported column
# ---- This then assigns the imported data to the correct class attribute
read_validated_data(
input_dict,
configuration_dict,
file_name,
sheet_name,
config_map,
validation_settings,
)
map_imported_datasets
map_imported_datasets
is used first under load_data
and then under prepare_input_data
, but prepare_input_data
is called in load_data
. There seems to be some redundancy here?
prepare_input_data
For prepare_input_data
I have a few questions and thoughts:
- Are the
if
conditions designed for the cases when only a subset of these exist? What are those conditions -- I was under the impression that all of these need to exist for the workflow to execute.
- Related to the above -- what happens when none of the
if
conditions is true? - In
prepare_input_data
there are clear sections handling different data types. The function is so long that I wonder if it makes sense to factor some of them out as standalone functions. - Related to the above, since
prepare_input_data
is only used once underload_data
, depending on what the answers of 1-2 are, it might make sense to make individual functions for each data type and just call all of them inload_data
?
docs
While reviewing the code, I was trying to understand what is in the config table and found terms such as "superlayer" and "name" pretty confusing. I couldn't find a documentation page that explains what's what in the config file -- is there one? If not, I think we should create such a page with a block diagram showing the flow of operations in load_data
. There are multiple mapping and cleaning/subselection in the code, and this will help users/readers understand what is done and why, and potentially help with future refactoring/simplification.
about loading NASC
I think this is the first time I skimmed through the NASC loading functions, so 2 questions:
- Are there currently tests for the result of
load_nasc
to compare values from those loaded into Matlab? If not, can that be done? Not sure if there is any bug in the Matlab code to prevent that. - What is the difference between
read_echoview_export
andbatch_read_echoview_export
?
This PR applies a bug-fix to the NASC ingestion procedure by deprecating the creation of the haul-transect and transection-region key external files. This removes the need to intermittently generate these files since the haul-transect map is not required until the biological data are ingested. In theory, there may be a need to retain reading in the
biodata_gear.xlsx
file for conserving precise transect information for the biological data (i.e. for bootstrapping/randomization). But this is not necessary for the currently available workflow, and possibly avoids issues in the future for datasets where thebiodata_gear.xlsx
file(s) havetransect_num
missing entirely.