Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Include only referenced data columns in chart specs #2586

Closed
wants to merge 2 commits into from

Conversation

joelostblom
Copy link
Contributor

This PR takes an inelegant and conservative approach to fix #2428. It converts the chart spec (without the data) to a string and removes each of the data field names that is not found in that string (i.e. the data fields not referenced anywhere in the spec). It will not work for short columns names such as 'x', or those that have the same name as a key in the VegaLite spec, e.g. 'field', but it is still likely useful for the vast majority of cases and can reduce the chart size significantly specification with many unused columns. For example, this chart

import altair as alt
from vega_datasets import data

alt.data_transformers.disable_max_rows()
source = data.movies()

import sys
sys.getsizeof(alt.Chart(source).mark_rect().encode(
    alt.X('IMDB_Rating:Q', bin=alt.Bin(maxbins=60)),
    alt.Y('Rotten_Tomatoes_Rating:Q', bin=alt.Bin(maxbins=40)),
    alt.Color('count():Q', scale=alt.Scale(scheme='greenblue'))
).to_json())

has a size of 1873345 bytes because there is a lot of unused columns included in the data that are not in the spec:

{'data-d01e86a42596d3c55cbbce3d21a64616': [
   {'Title': 'The Land Girls',
   'US_Gross': 146083.0,
   'Worldwide_Gross': 146083.0,
   'US_DVD_Sales': None,
   'Production_Budget': 8000000.0,
   'Release_Date': 'Jun 12 1998',
   'MPAA_Rating': 'R',
   'Running_Time_min': None,
   'Distributor': 'Gramercy',
   'Source': None,
   'Major_Genre': None,
   'Creative_Type': None,
   'Director': None,
   'Rotten_Tomatoes_Rating': None,
   'IMDB_Rating': 6.1,
   'IMDB_Votes': 1071.0},
   ...

Creating the same chart with the changes in this PR reduces the chart size ~7x to 269838 bytes and successfully remove all the irrelevant columns since they have unique names not found anywhere in the spec:

{'data-d01e86a42596d3c55cbbce3d21a64616': [
  {'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
  ...

This PR seems to pass all the tests, but I still want to run some benchmarks to see if the loops introduce any significant performance regressions for larger data sets.

I also need to move the change to the schema generation file instead of the automatically generated vegalite api file and this PR is not important as some of the other open PRs, but I started playing around with it and thought I would put this up in case anyone else wants to try or have better ideas of how to achieve the same results.

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 2, 2022

I've been worried about adding this sort of thing because the failure mode in which we remove a column that's needed by the chart is something that can't easily be recovered from. If we land something like this, I think it needs an "off switch" so that Altair does not become completely unusable in the event of a corner case we haven't thought of.

@joelostblom
Copy link
Contributor Author

joelostblom commented Apr 2, 2022

Do you mean adding something like alt.data_transformers.disable_data_compression()? But still having the "compression" in this PR (or similar) as the default behavior?

I was trying to think of a scenario where a column could be used in Vega-Lite without the column name being explicity referenced at least somewhere in the spec, but could not imagine what that would look like. Maybe @domoritz knows if this could happen (let me know if you prefer not to be pinged like this @domoritz)?


Another think I think would be helpful to reduce the chart size would be if Vega-Lite supported data in columnar instead of records format, since this would lead to much fewer instances of the field names being repeated in the spec. So instead of the current list of dicts:

[
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
{'Rotten_Tomatoes_Rating': None, 'IMDB_Rating': 6.1},
]

it would just be a single dict:

{
'Rotten_Tomatoes_Rating': [None, None, None],
'IMDB_Rating': [6.1, 6.1, 6.1]
},

I haven't looked closely at whether this is possible, maybe it is already and we just need to change the default in Altair.

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 2, 2022

One situation I can think of is column names with special characters that may or may not be escaped when referenced in the chart spec. You could also construct column names dynamically using a Vega expression, which is probably not common but would break with this approach.

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 2, 2022

Here's an example that would break with this PR, I believe:

import altair as alt
import pandas as pd

df = pd.DataFrame({'somecolumn': [1, 2, 3, 4]})

alt.Chart(df).transform_calculate(
    data='datum["some" + "column"]'
).mark_point().encode(
    x='data:Q'
)

@domoritz
Copy link
Member

domoritz commented Apr 3, 2022

Another case could be expressions that reference the keys of a data tuple. Then we should not project either.

In general, I think projecting data to only what's really needed is great but we need to be careful to provide an escape hatch and that we consider common hard cases like the ones we already identified.

@joelostblom
Copy link
Contributor Author

joelostblom commented Apr 3, 2022 via email

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 3, 2022

Even with an escape hatch, I’m not sure we’d want to land this change. Imagine the headache of trying to write some non-trivial Vega expression and finding after an hour that Altair is silently removing the column you’re trying to reference. Unless we can make this account for all cases, we probably shouldn’t land it. And accounting for all cases would essentially require re-implementing Vega in Python.

@joelostblom
Copy link
Contributor Author

joelostblom commented Apr 3, 2022 via email

@jakevdp
Copy link
Collaborator

jakevdp commented Apr 3, 2022

Sure, opt-in would be better, but then 99% of users would never know the feature exists and never benefit from it. I think better would be to avoid the easy broken approach and try for the full correct approach... it would basically require rolling the altair_transform package into the main altair library, expanding its completeness, and utilizing that.

A simpler option might be to only enable this only in the case that the chart contains no transformations, which I think are the only cases that create problems.

@joelostblom
Copy link
Contributor Author

I agree about the problems with discoverability. I like your suggestion of a conservative default option to only enable this compression/projection for specs that do not include transforms. This will benefit many users and we could add an optional flag to include charts with transforms and raise a warning message if this is enabled:

alt.data_transformers.compress_datasets(transformed_charts=True)
UserWarning: Data compression can break charts with transforms where data fields are referenced
without the field name occurring literally in the chart spec. Please revert this option if you are
observing unexpected behavior with transformed charts.

I updated the draft PR with a suggestion of what this could look like. I also agree that the altair_transform options sounds like the best long term solution, and this PR could be valuable in the meantime without breaking changes when/if there is a switch to the altair_transform option in the backend.

@joelostblom joelostblom force-pushed the referenced-cols branch 2 times, most recently from 14c33d5 to 8d1e77a Compare April 4, 2022 02:29
@binste
Copy link
Contributor

binste commented Dec 30, 2022

For reference, removing unused columns is soon possible in a new version of VegaFusion, see discussions in #2738.

@joelostblom
Copy link
Contributor Author

Ah that is great, thanks for mentioning that! We can probably close this PR once we have played around with VegaFusion a bit more then. There would probably still be value in having a simple approach in altair by default, but unless we can solve the issues mentioned above, it might be unnecessarily complicated to have something in altair that works only if there are no transforms instead of just referring to the VegaFusion option for all chart size reduction needs.

@mattijn
Copy link
Contributor

mattijn commented Feb 26, 2023

Let's leave this to VegaFusion if this a concern for anyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include only referenced columns in the data of each chart specification
5 participants