Skip to content

Commit

Permalink
Preserve numeric string literals when reading from json.
Browse files Browse the repository at this point in the history
This avoids surprises like pandas-dev/pandas#42471
  • Loading branch information
robertwb committed Aug 23, 2024
1 parent b2ed1c5 commit 142e392
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 2 deletions.
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,11 @@

## Breaking Changes

* In Python and YAML, ReadFromJson now override the dtype from None to
an explicit False. Most notably, string values like `"123"` are preserved
as strings rather than silently coerced (and possibly truncated) to numeric
values. To retain the old behavior, pass `dtype=True` (or any other value
accepted by `pandas.read_json`).
* X behavior was changed ([#X](https://github.com/apache/beam/issues/X)).

## Deprecations
Expand Down
14 changes: 12 additions & 2 deletions sdks/python/apache_beam/io/textio.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@
from functools import partial
from typing import TYPE_CHECKING
from typing import Any
from typing import Dict
from typing import Optional
from typing import Union

from apache_beam import typehints
from apache_beam.coders import coders
Expand Down Expand Up @@ -980,7 +982,12 @@ def WriteToCsv(

@append_pandas_args(pandas.read_json, exclude=['path_or_buf'])
def ReadFromJson(
path: str, *, orient: str = 'records', lines: bool = True, **kwargs):
path: str,
*,
orient: str = 'records',
lines: bool = True,
dtype: Union[bool, Dict[str, Any]] = False,
**kwargs):
"""A PTransform for reading json values from files into a PCollection.
Args:
Expand All @@ -992,11 +999,14 @@ def ReadFromJson(
lines (bool): Whether each line should be considered a separate record,
as opposed to the entire file being a valid JSON object or list.
Defaults to True (unlike Pandas).
dtype (bool): If True, infer dtypes; if a dict of column to dtype,
then use those; if False, then don’t infer dtypes at all.
Defaults to False (unlike Pandas).
**kwargs: Extra arguments passed to `pandas.read_json` (see below).
"""
from apache_beam.dataframe.io import ReadViaPandas
return 'ReadFromJson' >> ReadViaPandas(
'json', path, orient=orient, lines=lines, **kwargs)
'json', path, orient=orient, lines=lines, dtype=dtype, **kwargs)

@append_pandas_args(
pandas.DataFrame.to_json, exclude=['path_or_buf', 'index'])
Expand Down

0 comments on commit 142e392

Please sign in to comment.