Aggregate#

There are two ways to aggregate data within Altair: within the encoding itself, or using a top level aggregate transform.

The aggregate property of a field definition can be used to compute aggregate summary statistics (e.g., median, min, max) over groups of data.

If any field in the specified encoding channels contains an aggregate, the resulting visualization will show aggregate data. In this case, all fields without a specified aggregation function are treated as group-by fields in the aggregation process.

For example, the following bar chart aggregates mean of acceleration, grouped by the number of Cylinders.

import altair as alt
from altair.datasets import data

cars = data.cars.url

alt.Chart(cars).mark_bar().encode(
    y='Cylinders:O',
    x='mean(Acceleration):Q',
)

The Altair shorthand string:

# ...
x='mean(Acceleration):Q',
# ...

is made available for convenience, and is equivalent to the longer form:

# ...
x=alt.X(field='Acceleration', aggregate='mean', type='quantitative'),
# ...

For more information on shorthand encodings specifications, see Encoding Shorthands.

The same plot can be shown via an explicitly computed aggregation, using the transform_aggregate() method:

alt.Chart(cars).mark_bar().encode(
    y='Cylinders:O',
    x='mean_acc:Q'
).transform_aggregate(
    mean_acc='mean(Acceleration)',
    groupby=["Cylinders"]
)

The alternative to using aggregate functions is to preprocess the data with Pandas, and then plot the resulting DataFrame:

cars_df = data.cars()
source = (
   cars_df.groupby('Cylinders')
   .Acceleration
   .mean()
   .reset_index()
   .rename(columns={'Acceleration': 'mean_acc'})
)

alt.Chart(source).mark_bar().encode(
   y='Cylinders:O',
   x='mean_acc:Q'
)

Note

Altair transforms are great for quick exploration, while upfront analysis using dedicated dataframe libraries can be faster for large datasets. See Data Transformers for details.

Because Cylinders is of type int64 in the source DataFrame, Altair would have treated it as a qualitative –instead of ordinal– type, had we not specified it. Making the type of data explicit is important since it affects the resulting plot; see Effect of Data Type on Color Scales and Effect of Data Type on Axis Scales for two illustrated examples. As a rule of thumb, it is better to make the data type explicit, instead of relying on an implicit type conversion.

Functions Without Arguments#

Aggregate functions can be used without arguments. In such cases, the function operates directly on the input objects and returns the same value regardless of the provided field.

The following chart demonstrates this by counting the number of cars with respect to their country of origin.

alt.Chart(cars).mark_bar().encode(
   y='Origin:N',
   # shorthand form of alt.X(aggregate='count')
   x='count()'
)

Note

The count aggregate function is of type quantitative by default, it does not matter if the source data is a DataFrame, URL pointer, CSV file or JSON file.

Functions that handle categorical data (such as count, missing, distinct and valid) are the ones that get the most out of this feature.

Argmin and Argmax Functions#

The argmin and argmax functions help you find values from one field that correspond to the minimum or maximum values in another field. For example, you might want to find the production budget of movies that earned the highest gross revenue in each genre.

These functions must be used with the transform_aggregate() method rather than their shorthand notations. They return objects that act as selectors for values in other columns, rather than returning values directly. You can think of the returned object as a dictionary where the column serves as a key to retrieve corresponding values.

To illustrate this, let’s compare the weights of cars with the highest horsepower across different regions of origin:

alt.Chart(cars).mark_bar().encode(
   x='greatest_hp[Weight_in_lbs]:Q',
   y='Origin:N'
).transform_aggregate(
   greatest_hp='argmax(Horsepower)',
   groupby=['Origin']
)

This visualization reveals an interesting contrast: among cars with the highest horsepower in their respective regions, Japanese cars are notably lighter, while American cars are substantially heavier.

See Line Chart with Custom Legend for another example that uses argmax. The case of argmin is completely similar.

Transform Options#

The transform_aggregate() method is built on the AggregateTransform class, which has the following options:

Click to show table

Property

Type

Description

aggregate

array(AggregatedFieldDef)

Array of objects that define fields to aggregate.

groupby

array(FieldName)

The data fields to group by. If not specified, a single group containing all data objects will be used.

The AggregatedFieldDef objects have the following options:

Click to show table

Property

Type

Description

as

FieldName

The output field names to use for each aggregated field.

field

FieldName

The data field for which to compute aggregate function. This is required for all aggregation operations except "count".

op

AggregateOp

The aggregation operation to apply to the fields (e.g., "sum", "average", or "count"). See the full list of supported aggregation operations <https://vega.github.io/vega-lite/docs/aggregate.html#ops>__ for more information.

Aggregation Functions#

In addition to count and average, there are a large number of available aggregation functions built into Altair; they are listed in the following tables:

Basic Mathematical Operations#

Aggregate

Description

Example

sum

The sum of field values.

Streamgraph

product

The product of field values.

N/A

Central Tendency Measures#

Aggregate

Description

Example

mean

The mean (average) field value.

Interactive Scatter Plot and Linked Layered Histogram

average

The mean (average) field value. Identical to mean.

Line Chart with Layered Aggregates

median

The median field value

Boxplot with Min/Max Whiskers

variance

The sample variance of field values.

N/A

variancep

The population variance of field values.

N/A

stdev

The sample standard deviation of field values.

N/A

stdevp

The population standard deviation of field values.

N/A

stderr

The standard error of the field values.

N/A

Distribution Statistics#

Aggregate

Description

Example

q1

The lower quartile boundary of values.

Boxplot with Min/Max Whiskers

q3

The upper quartile boundary of values.

Boxplot with Min/Max Whiskers

ci0

The lower boundary of the bootstrapped 95% confidence interval of the mean.

Sorted Error Bars showing Confidence Interval

ci1

The upper boundary of the bootstrapped 95% confidence interval of the mean.

Sorted Error Bars showing Confidence Interval

Range Functions#

Aggregate

Description

Example

min

The minimum field value.

Boxplot with Min/Max Whiskers

max

The maximum field value.

Boxplot with Min/Max Whiskers

argmin

An input data object containing the minimum field value.

N/A

argmax

An input data object containing the maximum field value.

Line Chart with Custom Legend