Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add --meta parameter to explicitly specify the jsonl field dtypes #63

Closed
miguelusque opened this issue May 12, 2024 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@miguelusque
Copy link
Contributor

miguelusque commented May 12, 2024

Is your feature request related to a problem? Please describe.
When reading jsonl files with Dask, the dataframe datatypes are inferred unless explicitly specified.

Inferring the data types can lead to several issues, such as incorrect type inference, degradation of performance and increased memory usage among others.

I think we could mitigate those issues if we would add a --meta parameter, which would receive a dictionary of datatypes.

That parameter would be optional, and be similar to the --meta parameter available here: https://docs.dask.org/en/latest/generated/dask.dataframe.read_json.html.

@miguelusque miguelusque added the enhancement New feature or request label May 12, 2024
@miguelusque
Copy link
Contributor Author

I will work in the feature.

@miguelusque
Copy link
Contributor Author

PR #75.

ayushdg pushed a commit that referenced this issue May 30, 2024
…ield dtypes (#75)

* Add dtype support (optional) when reading jsonl files

Signed-off-by: Miguel Martínez <miguelm@nvidia.com>
Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Change input_meta type hint

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Change input_meta type hint

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Resolve merge conflit

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Assign input_meta to the right variable

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Add warning when input_meta is used with non jsonl files.

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Explicitly check for None when validating input_meta

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Add input_meta test

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Add description to function

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

* Add test_meta_str

Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>

---------

Signed-off-by: Miguel Martínez <miguelm@nvidia.com>
Signed-off-by: Miguel Martínez <miguelusque@users.noreply.github.com>
Co-authored-by: Miguel Martínez <miguelusque@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant