Aggregate#
There are two ways to aggregate data within Altair: within the encoding itself, or using a top level aggregate transform.
The aggregate property of a field definition can be used to compute aggregate
summary statistics (e.g., median, min, max) over groups of data.
If any field in the specified encoding channels contains an aggregate, the resulting visualization will show aggregate data. In this case, all fields without a specified aggregation function are treated as group-by fields in the aggregation process.
For example, the following bar chart aggregates mean of acceleration,
grouped by the number of Cylinders.
import altair as alt
from altair.datasets import data
cars = data.cars.url
alt.Chart(cars).mark_bar().encode(
y='Cylinders:O',
x='mean(Acceleration):Q',
)
The Altair shorthand string:
# ...
x='mean(Acceleration):Q',
# ...
is made available for convenience, and is equivalent to the longer form:
# ...
x=alt.X(field='Acceleration', aggregate='mean', type='quantitative'),
# ...
For more information on shorthand encodings specifications, see Encoding Shorthands.
The same plot can be shown via an explicitly computed aggregation, using the
transform_aggregate() method:
alt.Chart(cars).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
).transform_aggregate(
mean_acc='mean(Acceleration)',
groupby=["Cylinders"]
)
The alternative to using aggregate functions is to preprocess the data with Pandas, and then plot the resulting DataFrame:
cars_df = data.cars()
source = (
cars_df.groupby('Cylinders')
.Acceleration
.mean()
.reset_index()
.rename(columns={'Acceleration': 'mean_acc'})
)
alt.Chart(source).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
)
Note
Altair transforms are great for quick exploration, while upfront analysis using dedicated dataframe libraries can be faster for large datasets. See Data Transformers for details.
Because Cylinders is of type int64 in the source
DataFrame, Altair would have treated it as a qualitative –instead of
ordinal– type, had we not specified it. Making the type of data
explicit is important since it affects the resulting plot; see
Effect of Data Type on Color Scales and Effect of Data Type on Axis Scales for two illustrated
examples. As a rule of thumb, it is better to make the data type explicit,
instead of relying on an implicit type conversion.
Functions Without Arguments#
Aggregate functions can be used without arguments. In such cases, the function operates directly on the input objects and returns the same value regardless of the provided field.
The following chart demonstrates this by counting the number of cars with respect to their country of origin.
alt.Chart(cars).mark_bar().encode(
y='Origin:N',
# shorthand form of alt.X(aggregate='count')
x='count()'
)
Note
The count aggregate function is of type quantitative by default,
it does not matter if the source data is a DataFrame, URL pointer, CSV file or JSON file.
Functions that handle categorical data (such as count,
missing, distinct and valid) are the ones that get
the most out of this feature.
Argmin and Argmax Functions#
The argmin and argmax functions help you find values from
one field that correspond to the minimum or maximum values in another
field. For example, you might want to find the production budget of
movies that earned the highest gross revenue in each genre.
These functions must be used with the transform_aggregate()
method rather than their shorthand notations. They return objects that act
as selectors for values in other columns, rather than returning values
directly. You can think of the returned object as a dictionary where the
column serves as a key to retrieve corresponding values.
To illustrate this, let’s compare the weights of cars with the highest horsepower across different regions of origin:
alt.Chart(cars).mark_bar().encode(
x='greatest_hp[Weight_in_lbs]:Q',
y='Origin:N'
).transform_aggregate(
greatest_hp='argmax(Horsepower)',
groupby=['Origin']
)
This visualization reveals an interesting contrast: among cars with the highest horsepower in their respective regions, Japanese cars are notably lighter, while American cars are substantially heavier.
See Line Chart with Custom Legend for another example that uses
argmax. The case of argmin is completely similar.
Transform Options#
The transform_aggregate() method is built on the AggregateTransform
class, which has the following options:
Click to show table
Property |
Type |
Description |
|---|---|---|
aggregate |
array( |
Array of objects that define fields to aggregate. |
groupby |
array( |
The data fields to group by. If not specified, a single group containing all data objects will be used. |
The AggregatedFieldDef objects have the following options:
Click to show table
Property |
Type |
Description |
|---|---|---|
as |
The output field names to use for each aggregated field. |
|
field |
The data field for which to compute aggregate function. This is required for all aggregation operations except |
|
op |
The aggregation operation to apply to the fields (e.g., |
Aggregation Functions#
In addition to count and average, there are a large number of available
aggregation functions built into Altair; they are listed in the following tables:
Basic Mathematical Operations#
Aggregate |
Description |
Example |
|---|---|---|
sum |
The sum of field values. |
|
product |
The product of field values. |
N/A |
Central Tendency Measures#
Aggregate |
Description |
Example |
|---|---|---|
mean |
The mean (average) field value. |
|
average |
The mean (average) field value. Identical to mean. |
|
median |
The median field value |
|
variance |
The sample variance of field values. |
N/A |
variancep |
The population variance of field values. |
N/A |
stdev |
The sample standard deviation of field values. |
N/A |
stdevp |
The population standard deviation of field values. |
N/A |
stderr |
The standard error of the field values. |
N/A |
Distribution Statistics#
Aggregate |
Description |
Example |
|---|---|---|
q1 |
The lower quartile boundary of values. |
|
q3 |
The upper quartile boundary of values. |
|
ci0 |
The lower boundary of the bootstrapped 95% confidence interval of the mean. |
|
ci1 |
The upper boundary of the bootstrapped 95% confidence interval of the mean. |
Range Functions#
Aggregate |
Description |
Example |
|---|---|---|
min |
The minimum field value. |
|
max |
The maximum field value. |
|
argmin |
An input data object containing the minimum field value. |
N/A |
argmax |
An input data object containing the maximum field value. |