Demo: Exploring the Cars Dataset

We’ll start this tutorial with a demo to whet your appetite for learning more. This section purposely moves quickly through many of the concepts (e.g. data, marks, encodings, aggregation, data types, selections, etc.) We will return to treat each of these in more depth later in the tutorial, so don’t worry if it all seems to go a bit quickly!

In the tutorial itself, this will be done from scratch in a blank notebook. However, for the sake of people who want to look back on what we did live, I’ll do my best to reproduce the examples and the discussion here.

1. Imports and Data

We’ll start with importing the Altair package:

import altair as alt

Now we’ll use the vega_datasets package, to load an example dataset:

from vega_datasets import data

cars = data.cars()
cars.head()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

Notice that this data is in columnar format: that is, each column contains an attribute of a data point, and each row contains a single instance of the data (here, a single make & model of car).

2. Zero, One, and Two-dimensional Charts

Using Altair, we can being to explore this data.

The most basic chart contains the dataset, along with a mark to represent each row:

alt.Chart(cars).mark_point()

This is a pretty silly chart, because it consists of 406 points, all laid-out on top of each other.

To make it more interesting, we need to encode columns of the data into visual features of the plot (e.g. x position, y position, size, color, etc.)

Let’s encode miles per gallon on the x-axis using the encode() method:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon'
)

This is a bit better, but the point mark is probably not the best for a 1D chart like this.

Let’s try the tick mark instead:

alt.Chart(cars).mark_tick().encode(
    x='Miles_per_Gallon'
)

Or we can expand this into a 2D chart by also encoding the y value. We’ll return to using point markers, and put Horsepower on the y-axis

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower'
)

3 Simple Interactions

One of the nicest features of Altair is the grammar of interaction that it provides. The simplest kind of interaction is the ability to pan and zoom along charts; Altair contains a shortcut to enable this via the interactive() method:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower'
).interactive()

This lets you click and drag, as well as use your computer’s scroll/zoom behavior to zoom in and out on the chart.

We’ll see other interactions later.

4. A Third Dimension: Color

A 2D plot allows us to encode two dimensions of the data. Let’s look at using color to encode a third:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
)

Notice that when we use a categorical value for color, it chooses an appropriate color map for categorical data.

Let’s see what happens when we use a continuous color value:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Acceleration'
)

A continuous color results in a color scale that is appropriate for continuous data.

What about the in-between case: ordered categories, like number of cylinders?

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Cylinders'
)

Altair still chooses a continuous value because the number of Cylinders is numerical.

We can improve this by specifying that the data should be treated as a discrete ordered value; we can do this by adding ":O" (“O” for “ordinal” or “ordered categories”) after the encoding:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Cylinders:O'
)

Now we get a discrete legend with an ordered color mapping.

5. Binning and aggregation

Let’s return quickly to our 1D chart of miles per gallon:

alt.Chart(cars).mark_tick().encode(
    x='Miles_per_Gallon',
)

Another way we might represent this data is to creat a histogram: to bin the x data and show the count on the y axis. In many plotting libraries this is done with a special method like hist(). In Altair, such binning and aggregation is part of the declarative API.

To move beyond a simple field name, we use alt.X() for the x encoding, and we use 'count()' for the y encoding:

alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=True),
    y='count()'
)

If we want more control over the bins, we can use alt.Bin to adjust bin parameters

alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()'
)

If we apply another encoding (such as color), the data will be automatically grouped within each bin:

alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Origin'
)

If you prefer a separate plot for each category, the column encoding can help:

alt.Chart(cars).mark_bar().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),
    y='count()',
    color='Origin',
    column='Origin'
)

Binning and aggregation works in two dimensions as well; we can use the rect marker and visualize the count using the color:

alt.Chart(cars).mark_rect().encode(
    x=alt.X('Miles_per_Gallon', bin=True),
    y=alt.Y('Horsepower', bin=True),
    color='count()'
)

Aggregations can be more than simple counts; we can also aggregate and compute the mean of a third quantity within each bin

alt.Chart(cars).mark_rect().encode(
    x=alt.X('Miles_per_Gallon', bin=True),
    y=alt.Y('Horsepower', bin=True),
    color='mean(Weight_in_lbs)'
)

6. Time-Series & Layering

So far we’ve been ignoring the date column, but it’s interesting to see the trends with time of, for example, miles per gallon:

alt.Chart(cars).mark_point().encode(
    x='Year',
    y='Miles_per_Gallon'
)

Each year has a number of cars, and a lot of overlap in the data. We can clean this up a bit by plotting the mean at each x value:

alt.Chart(cars).mark_line().encode(
    x='Year',
    y='mean(Miles_per_Gallon)',
)

Alternatively, we can change the mark to area and use the ci0 and ci1 mark to plot the confidence interval of the estimate of the mean:

alt.Chart(cars).mark_area().encode(
    x='Year',
    y='ci0(Miles_per_Gallon)',
    y2='ci1(Miles_per_Gallon)'
)

Let’s adjust this chart a bit: add some opacity, color by the country of origin, and make the width a bit wider, and add a cleaner axis title:

alt.Chart(cars).mark_area(opacity=0.3).encode(
    x=alt.X('Year', timeUnit='year'),
    y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),
    y2='ci1(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=800
)

Finally, we can use Altair’s layering API to layer a line chart representing the mean on top of the area chart representing the confidence interval:

spread = alt.Chart(cars).mark_area(opacity=0.3).encode(
    x=alt.X('Year', timeUnit='year'),
    y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),
    y2='ci1(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=800
)

lines = alt.Chart(cars).mark_line().encode(
    x=alt.X('Year', timeUnit='year'),
    y='mean(Miles_per_Gallon)',
    color='Origin'
).properties(
    width=800
)

spread + lines

7. Interactivity: Selections

Let’s return to our scatter plot, and take a look at the other types of interactivity that Altair offers:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
)

Recall that you can add interactive() to the end of a chart to enable the most basic interactive scales:

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
).interactive()

Altair provides a general selection API for creating interactive plots; for example, here we create an interval selection:

interval = alt.selection_interval()

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color='Origin'
).add_selection(
    interval
)

Currently this selection doesn’t actually do anything, but we can change that by conditioning the color on this selection:

interval = alt.selection_interval()

alt.Chart(cars).mark_point().encode(
    x='Miles_per_Gallon',
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray'))
).add_selection(
    interval
)

The nice thing about this selection API is that it automatically applies across any compound charts; for example, here we can horizontally concatenate two charts, and since they both have the same selection they both respond appropriately:

interval = alt.selection_interval()

base = alt.Chart(cars).mark_point().encode(
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).add_selection(
    interval
)

base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')

We can do even more sophisticated things with selections as well. For example, let’s make a histogram of the number of cars by Origin, and stack it on our scatterplot:

interval = alt.selection_interval()

base = alt.Chart(cars).mark_point().encode(
    y='Horsepower',
    color=alt.condition(interval, 'Origin', alt.value('lightgray')),
    tooltip='Name'
).add_selection(
    interval
)

hist = alt.Chart(cars).mark_bar().encode(
    x='count()',
    y='Origin',
    color='Origin'
).properties(
    width=800,
    height=80
).transform_filter(
    interval
)

scatter = base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')

scatter & hist

This demo has covered a number of the available components of Altair. In the following sections, we’ll look into each of these a bit more systematically.