Flatten Transform

The flatten transform can be used to extract the contents of arrays from data entries. This will not generally be useful for well-structured data within pandas dataframes, but it can be useful for working with data from other sources.

As an example, consider this dataset which uses a common convention in JSON data, a set of fields each containing a list of entries:

import numpy as np

rand = np.random.RandomState(0)

def generate_data(N):
    mean = rand.randn()
    std = rand.rand()
    return list(rand.normal(mean, std, N))

data = [
    {'label': 'A', 'values': generate_data(20)},
    {'label': 'B', 'values': generate_data(30)},
    {'label': 'C', 'values': generate_data(40)},
    {'label': 'D', 'values': generate_data(50)},

This kind of data structure does not work well in the context of dataframe representations, as we can see by loading this into pandas:

import pandas as pd
df = pd.DataFrame.from_records(data)
      label                                             values
    0     A  [2.005252455842496, 0.3967871813856627, 2.5678...
    1     B  [1.1906228762083413, -1.6927165224630425, -0.5...
    2     C  [0.3901956756272385, 1.4135072065946024, 0.603...
    3     D  [1.0035211072316703, 1.1414240499680273, 1.883...

Alair’s flatten transform allows you to extract the contents of these arrays into a column that can be referenced by an encoding:

import altair as alt


This can be particularly useful in cleaning up data specified via a JSON URL, without having to first load the data for manipulation in pandas.

Transform Options

The transform_flatten() method is built on the FlattenTransform class, which has the following options:






The output field names for extracted array values.

Default value: The field name of the corresponding array field



An array of one or more data fields containing arrays to flatten. If multiple fields are specified, their array values should have a parallel structure, ideally with the same length. If the lengths of parallel arrays do not match, the longest array will be used with null values added for missing entries.