Specifying Data in Altair

Each top-level chart object (i.e. Chart, LayerChart, and VConcatChart, HConcatChart, RepeatChart, FacetChart) accepts a dataset as its first argument. The dataset can be specified in one of three ways:

For example, here we specify data via a DataFrame:

import altair as alt
import pandas as pd

data = pd.DataFrame({'x': ['A', 'B', 'C', 'D', 'E'],
                     'y': [5, 3, 6, 7, 2]})

When data is specified as a DataFrame, the encoding is quite simple, as Altair uses the data type information provided by Pandas to automatically determine the data types required in the encoding.

By comparison, here we create the same chart using a Data object, with the data specified as a JSON-style list of records:

import altair as alt

data = alt.Data(values=[{'x': 'A', 'y': 5},
                        {'x': 'B', 'y': 3},
                        {'x': 'C', 'y': 6},
                        {'x': 'D', 'y': 7},
                        {'x': 'E', 'y': 2}])
    x='x:O',  # specify ordinal data
    y='y:Q',  # specify quantitative data

notice the extra markup required in the encoding; because Altair cannot infer the types within a Data object, we must specify them manually (here we use Encoding Shorthands to specify ordinal (O) for x and quantitative (Q) for y; see Encoding Data Types).

Similarly, we must also specify the data type when referencing data by URL:

import altair as alt
from vega_datasets import data
url = data.cars.url


We will further discuss encodings and associated types in Encodings, next.

Including Index Data

By design Altair only accesses dataframe columns, not dataframe indices. At times, relevant data appears in the index. For example:

import numpy as np
rand = np.random.RandomState(0)

data = pd.DataFrame({'value': rand.randn(100).cumsum()},
                    index=pd.date_range('2018', freq='D', periods=100))
    2018-01-01  1.764052
    2018-01-02  2.164210
    2018-01-03  3.142948
    2018-01-04  5.383841
    2018-01-05  7.251399

If you would like the index to be available to the chart, you can explicitly turn it into a column using the reset_index() method of Pandas dataframes:


If the index object does not have a name attribute set, the resulting column will be called "index". More information is available in the Pandas documentation.

Long-form vs. Wide-form Data

There are two common conventions for storing data in a dataframe, sometimes called long-form and wide-form. Both are sensible patterns for storing data in a tabular format; briefly, the difference is this:

  • wide-form data has one row per independent variable, with metadata recorded in the row and column labels.
  • long-form data has one row per observation, with metadata recorded within the table as values.

Altair’s grammar works best with long-form data, in which each row corresponds to a single observation along with its metadata.

A concrete example will help in making this distinction more clear. Consider a dataset consisting of stock prices of several companies over time. The wide-form version of the data might be arranged as follows:

wide_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01'],
                          'AAPL': [189.95, 182.22, 198.08],
                          'AMZN': [89.15, 90.56, 92.64],
                          'GOOG': [707.00, 693.00, 691.48]})
             Date    AAPL   AMZN    GOOG
    0  2007-10-01  189.95  89.15  707.00
    1  2007-11-01  182.22  90.56  693.00
    2  2007-12-01  198.08  92.64  691.48

Notice that each row corresponds to a single time-stamp (here time is the independent variable), while metadata for each observation (i.e. company name) is stored within the column labels.

The long-form version of the same data might look like this:

long_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01'],
                          'company': ['AAPL', 'AAPL', 'AAPL',
                                      'AMZN', 'AMZN', 'AMZN',
                                      'GOOG', 'GOOG', 'GOOG'],
                          'price': [189.95, 182.22, 198.08,
                                     89.15,  90.56,  92.64,
                                    707.00, 693.00, 691.48]})
             Date company   price
    0  2007-10-01    AAPL  189.95
    1  2007-11-01    AAPL  182.22
    2  2007-12-01    AAPL  198.08
    3  2007-10-01    AMZN   89.15
    4  2007-11-01    AMZN   90.56
    5  2007-12-01    AMZN   92.64
    6  2007-10-01    GOOG  707.00
    7  2007-11-01    GOOG  693.00
    8  2007-12-01    GOOG  691.48

Notice here that each row contains a single observation (i.e. price), along with the metadata for this observation (the date and company name). Importantly, the column and index labels no longer contain any useful metadata.

As mentioned above, Altair works best with this long-form data, because relevant data and metadata are stored within the table itself, rather than within the labels of rows and columns:


Wide-form data can be similarly visualized using e.g. layering (see Layered Charts), but it is far less convenient within Altair’s grammar.

Converting Between Long-form and Wide-form

Conversion between wide-form and long-form data is not part of the Altair schema, and must be done as an external preprocessing step. In Python, this kind of data manipulation can be done using Pandas, as discussed in detail in the Reshaping and Pivot Tables section of the Pandas documentation.

For converting wide-form data to the long-form data used by Altair, the melt method of dataframes can be used. The first argument to melt is the column or list of columns to treat as index variables; the remaining columns will be combined into an indicator variable and a value variable whose names can be optionally specified:

wide_form.melt('Date', var_name='company', value_name='price')
             Date company   price
    0  2007-10-01    AAPL  189.95
    1  2007-11-01    AAPL  182.22
    2  2007-12-01    AAPL  198.08
    3  2007-10-01    AMZN   89.15
    4  2007-11-01    AMZN   90.56
    5  2007-12-01    AMZN   92.64
    6  2007-10-01    GOOG  707.00
    7  2007-11-01    GOOG  693.00
    8  2007-12-01    GOOG  691.48

For more information on the melt method, see the Pandas melt documentation.

In case you would like to undo this operation and convert from long-form back to wide-form, the pivot method of dataframes is useful.

wide_form = long_form.pivot(index='Date', columns='company', values='price')
company       AAPL   AMZN    GOOG
2007-10-01  189.95  89.15  707.00
2007-11-01  182.22  90.56  693.00
2007-12-01  198.08  92.64  691.48

For more information on the pivot method, see the Pandas pivot documentation.