Aggregate#

There are two ways to aggregate data within Altair: within the encoding itself, or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate summary statistics (e.g., median, min, max) over groups of data.

If any field in the specified encoding channels contains an aggregate, the resulting visualization will show aggregate data. In this case, all fields without a specified aggregation function are treated as group-by fields in the aggregation process.

For example, the following bar chart aggregates mean of acceleration, grouped by the number of Cylinders.

import altair as alt
from altair.datasets import data

cars = data.cars.url

alt.Chart(cars).mark_bar().encode(
    y='Cylinders:O',
    x='mean(Acceleration):Q',
)

The Altair shorthand string:

# ...
x='mean(Acceleration):Q',
# ...

is made available for convenience, and is equivalent to the longer form:

# ...
x=alt.X(field='Acceleration', aggregate='mean', type='quantitative'),
# ...

For more information on shorthand encodings specifications, see Encoding Shorthands.

The same plot can be shown via an explicitly computed aggregation, using the transform_aggregate() method:

alt.Chart(cars).mark_bar().encode(
    y='Cylinders:O',
    x='mean_acc:Q'
).transform_aggregate(
    mean_acc='mean(Acceleration)',
    groupby=["Cylinders"]
)

The alternative to using aggregate functions is to preprocess the data with Pandas, and then plot the resulting DataFrame:

cars_df = data.cars()
source = (
   cars_df.groupby('Cylinders')
   .Acceleration
   .mean()
   .reset_index()
   .rename(columns={'Acceleration': 'mean_acc'})
)

alt.Chart(source).mark_bar().encode(
   y='Cylinders:O',
   x='mean_acc:Q'
)

Note

Altair transforms are great for quick exploration, while upfront analysis using dedicated dataframe libraries can be faster for large datasets. See Data Transformers for details.

Because Cylinders is of type int64 in the source DataFrame, Altair would have treated it as a qualitative –instead of ordinal– type, had we not specified it. Making the type of data explicit is important since it affects the resulting plot; see Effect of Data Type on Color Scales and Effect of Data Type on Axis Scales for two illustrated examples. As a rule of thumb, it is better to make the data type explicit, instead of relying on an implicit type conversion.

Functions Without Arguments#

Aggregate functions can be used without arguments. In such cases, the function operates directly on the input objects and returns the same value regardless of the provided field.

The following chart demonstrates this by counting the number of cars with respect to their country of origin.

alt.Chart(cars).mark_bar().encode(
   y='Origin:N',
   # shorthand form of alt.X(aggregate='count')
   x='count()'
)

Note

The count aggregate function is of type quantitative by default, it does not matter if the source data is a DataFrame, URL pointer, CSV file or JSON file.

Functions that handle categorical data (such as count, missing, distinct and valid) are the ones that get the most out of this feature.

Argmin and Argmax Functions#

The argmin and argmax functions help you find values from one field that correspond to the minimum or maximum values in another field. For example, you might want to find the production budget of movies that earned the highest gross revenue in each genre.

These functions must be used with the transform_aggregate() method rather than their shorthand notations. They return objects that act as selectors for values in other columns, rather than returning values directly. You can think of the returned object as a dictionary where the column serves as a key to retrieve corresponding values.

To illustrate this, let’s compare the weights of cars with the highest horsepower across different regions of origin:

alt.Chart(cars).mark_bar().encode(
   x='greatest_hp[Weight_in_lbs]:Q',
   y='Origin:N'
).transform_aggregate(
   greatest_hp='argmax(Horsepower)',
   groupby=['Origin']
)

This visualization reveals an interesting contrast: among cars with the highest horsepower in their respective regions, Japanese cars are notably lighter, while American cars are substantially heavier.

See Line Chart with Custom Legend for another example that uses argmax. The case of argmin is completely similar.

Transform Options#

The transform_aggregate() method is built on the AggregateTransform class, which has the following options:

Click to show table

Property	Type	Description
aggregate	array(`AggregatedFieldDef`)	Array of objects that define fields to aggregate.
groupby	array(`FieldName`)	The data fields to group by. If not specified, a single group containing all data objects will be used.

The AggregatedFieldDef objects have the following options:

Click to show table

Property	Type	Description
as	`FieldName`	The output field names to use for each aggregated field.
field	`FieldName`	The data field for which to compute aggregate function. This is required for all aggregation operations except `"count"`.
op	`AggregateOp`	The aggregation operation to apply to the fields (e.g., `"sum"`, `"average"`, or `"count"`). See the `full list of supported aggregation operations <https://vega.github.io/vega-lite/docs/aggregate.html#ops>`__ for more information.

Aggregation Functions#

In addition to count and average, there are a large number of available aggregation functions built into Altair; they are listed in the following tables:

Count-related Functions#

Aggregate	Description	Example
count	The total count of data objects in the group.	Simple Heatmap
valid	The count of field values that are not null or undefined.	N/A
missing	The count of null or undefined field values.	N/A
distinct	The count of distinct field values.	N/A
values	A list of data objects in the group.	N/A

Basic Mathematical Operations#

Aggregate	Description	Example
sum	The sum of field values.	Streamgraph
product	The product of field values.	N/A

Central Tendency Measures#

Aggregate	Description	Example
mean	The mean (average) field value.	Interactive Scatter Plot and Linked Layered Histogram
average	The mean (average) field value. Identical to mean.	Line Chart with Layered Aggregates
median	The median field value	Boxplot with Min/Max Whiskers
variance	The sample variance of field values.	N/A
variancep	The population variance of field values.	N/A
stdev	The sample standard deviation of field values.	N/A
stdevp	The population standard deviation of field values.	N/A
stderr	The standard error of the field values.	N/A

Distribution Statistics#

Aggregate	Description	Example
q1	The lower quartile boundary of values.	Boxplot with Min/Max Whiskers
q3	The upper quartile boundary of values.	Boxplot with Min/Max Whiskers
ci0	The lower boundary of the bootstrapped 95% confidence interval of the mean.	Sorted Error Bars showing Confidence Interval
ci1	The upper boundary of the bootstrapped 95% confidence interval of the mean.	Sorted Error Bars showing Confidence Interval

Range Functions#

Aggregate	Description	Example
min	The minimum field value.	Boxplot with Min/Max Whiskers
max	The maximum field value.	Boxplot with Min/Max Whiskers
argmin	An input data object containing the minimum field value.	N/A
argmax	An input data object containing the maximum field value.	Line Chart with Custom Legend