Binning and Aggregation¶

We have discussed data, marks, encodings, and encoding types. The next essential piece of Altair’s API is its approach to binning and aggregating data

import altair as alt

from vega_datasets import data
cars = data.cars()

cars.head()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

Group-By in Pandas¶

One key operation in data exploration is the group-by, discussed in detail in Chaper 4 of the Python Data Science Handbook. In short, the group-by splits the data according to some condition, applies some aggregation within those groups, and then combines the data back together:

Split Apply Combine figure Figure source

For the cars data, you might split by Origin, compute the mean of the miles per gallon, and then combine the results. In Pandas, the operation looks like this:

cars.groupby('Origin')['Miles_per_Gallon'].mean()

Origin
Europe    27.891429
Japan     30.450633
USA       20.083534
Name: Miles_per_Gallon, dtype: float64

In Altair, this sort of split-apply-combine can be performed by passing an aggregation operator within a string to any encoding. For example, we can display a plot representing the above aggregation as follows:

alt.Chart(cars).mark_bar().encode(
    y='Origin',
    x='mean(Miles_per_Gallon)'
)

Notice that the grouping is done implicitly within the encodings: here we group only by Origin, then compute the mean over each group.

One-dimensional Binnings: Histograms¶

One of the most common uses of binning is the creation of histograms. For example, here is a histogram of miles per gallon:

alt.Chart(cars).mark_bar().encode(
    alt.X('Miles_per_Gallon', bin=True),
    alt.Y('count()'),
    alt.Color('Origin')
)

One interesting thing that Altair’s declarative approach allows us to start assigning these values to different encodings, to see other views of the exact same data.

So, for example, if we assign the binned miles per gallon to the color, we get this view of the data:

alt.Chart(cars).mark_bar().encode(
    color=alt.Color('Miles_per_Gallon', bin=True),
    x='count()',
    y='Origin'
)

This gives us a better appreciation of the proportion of MPG within each country.

If we wish, we can normalize the counts on the x-axis to compare proportions directly:

alt.Chart(cars).mark_bar().encode(
    color=alt.Color('Miles_per_Gallon', bin=True),
    x=alt.X('count()', stack='normalize'),
    y='Origin'
)

We see that well over half of US cars were in the “low mileage” category.

Changing the encoding again, let’s map the color to the count instead:

alt.Chart(cars).mark_rect().encode(
    x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=20)),
    color='count()',
    y='Origin',
)

Now we see the same dataset as a heat map!

This is one of the beautiful things about Altair: it shows you through its API grammar the relationships between different chart types: for example, a 2D heatmap encodes the same data as a stacked histogram!

Other aggregates¶

Aggregates can also be used with data that is only implicitly binned. For example, look at this plot of MPG over time:

alt.Chart(cars).mark_point().encode(
    x='Year:T',
    color='Origin',
    y='Miles_per_Gallon'
)

The fact that the points overlap so much makes it difficult to see important parts of the data; we can make it clearer by plotting the mean in each group (here, the mean of each Year/Country combination):

alt.Chart(cars).mark_line().encode(
    x='Year:T',
    color='Origin',
    y='mean(Miles_per_Gallon)'
)

The mean aggregate only tells part of the story, though: Altair also provides built-in tools to compute the lower and upper bounds of confidence intervals on the mean.

We can use mark_area() here, and specify the lower and upper bounds of the area using y and y2:

alt.Chart(cars).mark_area(opacity=0.3).encode(
    x='Year:T',
    color='Origin',
    y='ci0(Miles_per_Gallon)',
    y2='ci1(Miles_per_Gallon)'
)

	date	temp
0	2010-01-01 00:00:00	39.4
1	2010-01-01 01:00:00	39.2
2	2010-01-01 02:00:00	39.0
3	2010-01-01 03:00:00	38.9
4	2010-01-01 04:00:00	38.8

Altair Tutorial

Binning and Aggregation¶

Group-By in Pandas¶

One-dimensional Binnings: Histograms¶

Other aggregates¶

Time Binnings¶

Aside: How Altair Encodes Data¶