Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve sensitive data handling by fsspec connectors #2194

Merged
merged 41 commits into from
Dec 5, 2023

Conversation

rbiseck3
Copy link
Contributor

@rbiseck3 rbiseck3 commented Nov 30, 2023

Description

Building off of PR #2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes:

  • Update the base classes to inherit from the enhanced json mixin
  • Add in a new access config dataclass that should be used as a nest dataclass in the connector configs
  • Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config
  • Update all fsspec connectors with explicit access configs given what each one's SDKs support
  • Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass.
  • Update all the writes to use class-based approach and share the same structure of the runner classes
  • Above update allowed for better code to be used in the base source and destination CLI commands
  • Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses.

Slightly unrelated changes:

  • session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked.
  • Dedicated write configs were created for all destination connectors to better support serialization
  • Refactor of Elasticsearch connector included, with update to ingest test to use auth

TODOs

  • Left a #TODO in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. googleapiclient.discovery.Resource) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow.

Copy link
Contributor

@potter-potter potter-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through everything. A lot to go through. But works well and I didn't see anything suspicious. Looks good. Really liking the access_configs instead of access_kwargs.

@rbiseck3 rbiseck3 added this pull request to the merge queue Dec 5, 2023
Merged via the queue into main with commit f193d3d Dec 5, 2023
50 checks passed
@rbiseck3 rbiseck3 deleted the roman/improve-sensitive-data-handling-ingest branch December 5, 2023 21:30
github-merge-queue bot pushed a commit that referenced this pull request Dec 11, 2023
### Description
All other connectors that were not included in
#2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants