Flatten#

The flatten transform can be used to extract the contents of arrays from data entries. This will not generally be useful for well-structured data within pandas dataframes, but it can be useful for working with data from other sources.

As an example, consider this dataset which uses a common convention in JSON data, a set of fields each containing a list of entries:

import numpy as np

rand = np.random.RandomState(0)

def generate_data(N):
    mean = rand.randn()
    std = rand.rand()
    return list(rand.normal(mean, std, N))

data = [
    {'label': 'A', 'values': generate_data(20)},
    {'label': 'B', 'values': generate_data(30)},
    {'label': 'C', 'values': generate_data(40)},
    {'label': 'D', 'values': generate_data(50)},
]

This kind of data structure does not work well in the context of dataframe representations, as we can see by loading this into pandas:

import pandas as pd
df = pd.DataFrame.from_records(data)
df

      label                                             values
   A  [2.005252455842496, 0.3967871813856627, 2.5678...
   B  [1.1906228762083413, -1.6927165224630425, -0.5...
   C  [0.3901956756272385, 1.4135072065946024, 0.603...
   D  [1.0035211072316703, 1.1414240499680273, 1.883...

Alair’s flatten transform allows you to extract the contents of these arrays into a column that can be referenced by an encoding:

import altair as alt

alt.Chart(df).transform_flatten(
    ['values']
).mark_tick().encode(
    x='values:Q',
    y='label:N',
)

This can be particularly useful in cleaning up data specified via a JSON URL, without having to first load the data for manipulation in pandas.

Transform Options#

The transform_flatten() method is built on the FlattenTransform class, which has the following options:

Click to show table

Property

Type

Description

as

array(FieldName)

The output field names for extracted array values.

Default value: The field name of the corresponding array field

flatten

array(FieldName)

An array of one or more data fields containing arrays to flatten. If multiple fields are specified, their array values should have a parallel structure, ideally with the same length. If the lengths of parallel arrays do not match, the longest array will be used with null values added for missing entries.