Data Transformations

Altair provides a data transformation API that allows both filtering and transformation of values within the plot renderer. Within Vega-Lite, filter and transforms operations are specified in terms of javascript strings which make use of Vega’s Expression Documentation. Altair provides a Python-style interface to generate these expressions without having to create the strings manually; this can be done either via a direct functional expression interface, or via a Pandas-like dataframe interface. We will see examples of both of these below.

For example, consider this visualization of the historical US population, split by age and gender:

import altair as alt

data = 'https://vega.github.io/vega-datasets/data/population.json'
pink_blue = alt.Scale(range=["lightblue", "pink"])

alt.Chart(data).mark_bar().encode(
    x='age:O',
    y='mean(people):Q',
    color=alt.Color('sex:N', scale=pink_blue)
)

This visualization shows that on average over the course of history, the younger population has far outnumbered the older population.

  1. We might wish to zero-in on a particular year, rather than taking a mean over all years.
  2. The “1” and “2” labels for gender are not all that informative; we should probably be change them to “Male” and “Female” for clarity.

We could certainly accomplish this by downloading the dataset, manipulating it in, say, pandas, and building a chart using the result, but it would be nice to do this within the Altair spec itself so that we can use the original data source.

Vega-Lite allows for this via a transform field within the plot specification, and Atltair provides a Pandas-style interface by which these transform fields can be specified.

To demonstrate this, let’s remake the plot using this interface to filter the data by year, and to create a new column which maps the 1/2 labels to “Male”/”Female”:

import altair as alt
from altair import expr

pink_blue = alt.Scale(range=["pink", "lightblue"])

# this does not actually download data;
# just puts a dataframe-like interface around the URL reference
data = expr.DataFrame('https://vega.github.io/vega-datasets/data/population.json')

# Add a new column to the data
data['gender'] = expr.where(data.sex == 1, "Male", "Female")

# Create a filtered version of the data
data2000 = data[data.year == 2000]

alt.Chart(data2000).mark_bar().encode(
    x='age:O',
    y='mean(people):Q',
    color=alt.Color('gender:N', scale=pink_blue)
)

Creating and manipulating the data this way generates appropriate code that is stored in the spec and then evaluated at the time the plot is generated. We can see this by printing the resulting specification:

>>> import altair as alt
>>> from altair import expr
>>> data = expr.DataFrame('data.json')
>>> data['gender'] = expr.where(data.sex == 1, "Male", "Female")
>>> data2000 = data[data.year == 2000]
>>> print(alt.Chart(data2000).to_json(indent=2))
{
  "data": {
    "url": "data.json"
  },
  "transform": {
    "calculate": [
      {
        "expr": "if((datum.sex==1),'Male','Female')",
        "field": "gender"
      }
    ],
    "filter": "(datum.year==2000)"
  }
}

Notice that in the resulting specification the data field contains only the URL, and the additional information has been encoded within a transform field using the Expression Interface provided by the Vega package.

If you would prefer to add these field manually rather than using the expr.DataFrame interface, the transform_data() method and related Transform class gives you functional access to these attributes using the vega.expr syntax:

data = 'https://vega.github.io/vega-datasets/data/population.json'

alt.Chart(data).mark_bar().encode(
    x='age:O',
    y='mean(people):Q',
    color=alt.Color('gender:N', scale=pink_blue)
).transform_data(
    calculate=[alt.Formula('gender', expr.where(expr.df.sex==1,'Male','Female'))],
    filter=(expr.df.year == 2000)
)

Or if you really like to do things by hand, the raw javascript strings can be passed instead:

alt.Chart(data).mark_bar().encode(
    x='age:O',
    y='mean(people):Q',
    color=alt.Color('gender:N', scale=pink_blue)
).transform_data(
    calculate=[alt.Formula('gender', 'if(datum.sex == 1, "M", "F")')],
    filter=('datum.year == 2000')
)

While in all these cases the data manipulation could be done as a preprocessing step, embedding the processed data directly in the URL, this sort of simple manipulation of an existing data source can lead to much more compact and efficient plot specifications.

The Grouped Bar Chart example shows a more refined view of this same dataset using some of these techniques.