Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monthly data and ordinal encoding create an "off by one" error #1027

Closed
palewire opened this issue Jul 17, 2018 · 9 comments · Fixed by #1053
Closed

Monthly data and ordinal encoding create an "off by one" error #1027

palewire opened this issue Jul 17, 2018 · 9 comments · Fixed by #1053

Comments

@palewire
Copy link
Contributor

palewire commented Jul 17, 2018

I uncovered this "bug" during some data exploration for development of a open-source library tracking the U.S. Consumer Price Index.

The source data is monthly, with the values linked with the first date of each month.

import pandas as pd
import altair as alt
df = pd.read_json("https://raw.githubusercontent.com/datadesk/cpi/master/notebooks/last_13.json", dtype={"date_label": pd.np.datetime64})
df.head(13)[['date',  'pct_change_rounded']]

screenshot from 2018-07-16 18-33-09

My aim was to format the dates in the x-axis labels in the same manner as the government's sample chart.

screenshot from 2018-07-16 18-36-58

I tried to do that with timeUnit and the format option to axis.

alt.Chart(df).mark_bar().encode(
    x=alt.X("date:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

Here's what I got:

visualization 39

Look closely and you can see that the latest value, June 1, is rendered as May.

What's up with that?

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 17, 2018

Looks like something to do with Vega-Lite. I've opened an issue there.

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 18, 2018

Per response in vega/vega-lite#4044, t looks like the issue is that Javascript parses dates in UTC. You can fix this here by using timeUnit='utcyearmonth' rather than timeUnit='yearmonth'

@palewire
Copy link
Contributor Author

I'm not sure I totally get it. If I submit "2018-06-01," how is that parsed by VegaLite? What would the Python datetime object come back as?

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 19, 2018

My understanding is that it is parsed (in UTC) as 00:00 on 2018-06-01, which is then interpreted (in ISO) as midnight on 2018-05-31, leading to the issue you see.

Edit: I'm wrong here – see new comment below

The solution is to parse the month and year in UTC.

I agree that this is not very satisfactory, but it sounds like it's pretty deeply baked into the vega/vega-lite stack.

@palewire
Copy link
Contributor Author

palewire commented Jul 19, 2018 via email

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 26, 2018

OK, so it turns out I was completely wrong on the reason for this issue when I speculated above.

The issue is that the timestamps are being parsed as if they're UTC, and then displayed in local time (i.e. compensating for timezone). Because the west coast is 8 hours behind london, this shifts the time to the previous month for the dates you're using. If you were east of London when running this code, you wouldn't see this issue 😄.

So the best solution is to make sure dates are both parsed and displayed as local time, so no time zone correction is required.

Looking at the Vega-Lite docs on UTC time, it looks like (and this seems entirely crazy to me) the way you make sure dates are parsed as local time is to not use ISO format. Altair serializes datetime data in ISO format by default, so timezone corrections will always be applied.

But if you change the serialization to be non-ISO compliant, you can make things work in as expected:

df['date2'] = df['date'].dt.strftime('%b %d %Y')  # non-ISO serialization 

alt.Chart(df).mark_bar().encode(
    x=alt.X("date2:O", timeUnit="yearmonth", axis=alt.Axis(format="%b %y")),
    y="pct_change_rounded:Q"
)

visualization 18

This format-dependent parsing of dates in Vega is more than a bit surprising to me, and I hope that it can be addressed upstream. But if not, I'd propose we change the way we serialize dates in Altair so that they will be parsed the same way they are displayed, without any implicit time-zone conversion.

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 26, 2018

We'd just need to modify this line to use a non-ISO date format that is correctly parsed by the vega stack:
https://github.com/altair-viz/altair/blob/da9d6141efe74211f768adf2efaafceeb5962495/altair/utils/core.py#L127

@jakevdp
Copy link
Collaborator

jakevdp commented Jul 27, 2018

Update: thanks to some input from @domoritz I understand this a bit better. It looks like if we want to ensure dates are parsed as local times, we need to either use unix timestamps or fully-qualified ISO-8601 dates. This is not a characteristic of Vega or Vega-Lite, but of Javascript itself. For example:

Given a date string of "March 7, 2014", parse() assumes a local time zone, but given an ISO format such as "2014-03-07" it will assume a time zone of UTC (ES5 and ECMAScript 2015). Therefore Date objects produced using those strings may represent different moments in time depending on the version of ECMAScript supported unless the system is set with a local time zone of UTC. This means that two date strings that appear equivalent may result in two different values depending on the format of the string that is being converted.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/parse

So if we replace df[col_name].astype(str) with df[col_name].dt.strftime('%Y-%m-%dT%H:%M:%SZ') in sanitize_dataframe, then these kinds of off-by-one errors will be avoided.

@palewire
Copy link
Contributor Author

I'm still not sure I totally get this. But that is almost always the case when it comes to time zones! ;)

If you make a patch and want me to test my code again, I'm happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants