{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Demo: Exploring the Cars Dataset\n", "\n", "We'll start this tutorial with a demo to whet your appetite for learning more. This section purposely moves quickly through many of the concepts (e.g. data, marks, encodings, aggregation, data types, selections, etc.)\n", "We will return to treat each of these in more depth later in the tutorial, so don't worry if it all seems to go a bit quickly!\n", "\n", "In the tutorial itself, this will be done from scratch in a blank notebook.\n", "However, for the sake of people who want to look back on what we did live, I'll do my best to reproduce the examples and the discussion here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Imports and Data\n", "\n", "We'll start with importing the [Altair package](http://altair-viz.github.io/):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import altair as alt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll use the [vega_datasets package](https://github.com/altair-viz/vega_datasets), to load an example dataset:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "\n", "cars = data.cars()\n", "cars.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this data is in columnar format: that is, each column contains an attribute of a data point, and each row contains a single instance of the data (here, a single make & model of car)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Zero, One, and Two-dimensional Charts\n", "\n", "Using Altair, we can being to explore this data.\n", "\n", "The most basic chart contains the dataset, along with a mark to represent each row:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a pretty silly chart, because it consists of 406 points, all laid-out on top of each other.\n", "\n", "To make it more interesting, we need to *encode* columns of the data into visual features of the plot (e.g. x position, y position, size, color, etc.)\n", "\n", "Let's encode miles per gallon on the x-axis using the ``encode()`` method:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a bit better, but the ``point`` mark is probably not the best for a 1D chart like this.\n", "\n", "Let's try the ``tick`` mark instead:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_tick().encode(\n", " x='Miles_per_Gallon'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can expand this into a 2D chart by also encoding the y value. We'll return to using ``point`` markers, and put ``Horsepower`` on the y-axis" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Simple Interactions\n", "\n", "One of the nicest features of Altair is the grammar of interaction that it provides.\n", "The simplest kind of interaction is the ability to pan and zoom along charts; Altair contains a shortcut to enable this via the ``interactive()`` method:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower'\n", ").interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This lets you click and drag, as well as use your computer's scroll/zoom behavior to zoom in and out on the chart.\n", "\n", "We'll see other interactions later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. A Third Dimension: Color" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A 2D plot allows us to encode two dimensions of the data. Let's look at using color to encode a third:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that when we use a categorical value for color, it chooses an appropriate color map for categorical data.\n", "\n", "Let's see what happens when we use a continuous color value:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Acceleration'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A continuous color results in a color scale that is appropriate for continuous data.\n", "\n", "What about the in-between case: ordered categories, like number of cylinders?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Cylinders'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Altair still chooses a continuous value because the number of Cylinders is numerical.\n", "\n", "We can improve this by specifying that the data should be treated as a discrete ordered value; we can do this by adding ``\":O\"`` (\"O\" for \"ordinal\" or \"ordered categories\") after the encoding: \n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Cylinders:O'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we get a discrete legend with an ordered color mapping." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Binning and aggregation\n", "\n", "Let's return quickly to our 1D chart of miles per gallon:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_tick().encode(\n", " x='Miles_per_Gallon',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way we might represent this data is to creat a histogram: to bin the x data and show the count on the y axis.\n", "In many plotting libraries this is done with a special method like ``hist()``. In Altair, such binning and aggregation is part of the declarative API.\n", "\n", "To move beyond a simple field name, we use ``alt.X()`` for the x encoding, and we use ``'count()'`` for the y encoding:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " x=alt.X('Miles_per_Gallon', bin=True),\n", " y='count()'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want more control over the bins, we can use ``alt.Bin`` to adjust bin parameters" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),\n", " y='count()'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we apply another encoding (such as ``color``), the data will be automatically grouped within each bin:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),\n", " y='count()',\n", " color='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you prefer a separate plot for each category, the ``column`` encoding can help:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_bar().encode(\n", " x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=30)),\n", " y='count()',\n", " color='Origin',\n", " column='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Binning and aggregation works in two dimensions as well; we can use the ``rect`` marker and visualize the count using the color:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " x=alt.X('Miles_per_Gallon', bin=True),\n", " y=alt.Y('Horsepower', bin=True),\n", " color='count()'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregations can be more than simple counts; we can also aggregate and compute the mean of a third quantity within each bin" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_rect().encode(\n", " x=alt.X('Miles_per_Gallon', bin=True),\n", " y=alt.Y('Horsepower', bin=True),\n", " color='mean(Weight_in_lbs)'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Time-Series & Layering\n", "\n", "So far we've been ignoring the ``date`` column, but it's interesting to see the trends with time of, for example, miles per gallon:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Year',\n", " y='Miles_per_Gallon'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each year has a number of cars, and a lot of overlap in the data.\n", "We can clean this up a bit by plotting the mean at each x value:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_line().encode(\n", " x='Year',\n", " y='mean(Miles_per_Gallon)',\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can change the mark to ``area`` and use the ``ci0`` and ``ci1`` mark to plot the confidence interval of the estimate of the mean:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_area().encode(\n", " x='Year',\n", " y='ci0(Miles_per_Gallon)',\n", " y2='ci1(Miles_per_Gallon)'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's adjust this chart a bit: add some opacity, color by the country of origin, and make the width a bit wider, and add a cleaner axis title:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_area(opacity=0.3).encode(\n", " x=alt.X('Year', timeUnit='year'),\n", " y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),\n", " y2='ci1(Miles_per_Gallon)',\n", " color='Origin'\n", ").properties(\n", " width=800\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can use Altair's layering API to layer a line chart representing the mean on top of the area chart representing the confidence interval:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spread = alt.Chart(cars).mark_area(opacity=0.3).encode(\n", " x=alt.X('Year', timeUnit='year'),\n", " y=alt.Y('ci0(Miles_per_Gallon)', axis=alt.Axis(title='Miles per Gallon')),\n", " y2='ci1(Miles_per_Gallon)',\n", " color='Origin'\n", ").properties(\n", " width=800\n", ")\n", "\n", "lines = alt.Chart(cars).mark_line().encode(\n", " x=alt.X('Year', timeUnit='year'),\n", " y='mean(Miles_per_Gallon)',\n", " color='Origin'\n", ").properties(\n", " width=800\n", ")\n", "\n", "spread + lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Interactivity: Selections\n", "\n", "Let's return to our scatter plot, and take a look at the other types of interactivity that Altair offers:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Origin'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that you can add ``interactive()`` to the end of a chart to enable the most basic interactive scales:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Origin'\n", ").interactive()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Altair provides a general ``selection`` API for creating interactive plots; for example, here we create an interval selection:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interval = alt.selection_interval()\n", "\n", "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color='Origin'\n", ").add_selection(\n", " interval\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently this selection doesn't actually do anything, but we can change that by conditioning the color on this selection:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interval = alt.selection_interval()\n", "\n", "alt.Chart(cars).mark_point().encode(\n", " x='Miles_per_Gallon',\n", " y='Horsepower',\n", " color=alt.condition(interval, 'Origin', alt.value('lightgray'))\n", ").add_selection(\n", " interval\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The nice thing about this selection API is that it *automatically* applies across any compound charts; for example, here we can horizontally concatenate two charts, and since they both have the same selection they both respond appropriately:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.HConcatChart(...)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interval = alt.selection_interval()\n", "\n", "base = alt.Chart(cars).mark_point().encode(\n", " y='Horsepower',\n", " color=alt.condition(interval, 'Origin', alt.value('lightgray')),\n", " tooltip='Name'\n", ").add_selection(\n", " interval\n", ")\n", "\n", "base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do even more sophisticated things with selections as well.\n", "For example, let's make a histogram of the number of cars by Origin, and stack it on our scatterplot:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.VConcatChart(...)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "interval = alt.selection_interval()\n", "\n", "base = alt.Chart(cars).mark_point().encode(\n", " y='Horsepower',\n", " color=alt.condition(interval, 'Origin', alt.value('lightgray')),\n", " tooltip='Name'\n", ").add_selection(\n", " interval\n", ")\n", "\n", "hist = alt.Chart(cars).mark_bar().encode(\n", " x='count()',\n", " y='Origin',\n", " color='Origin'\n", ").properties(\n", " width=800,\n", " height=80\n", ").transform_filter(\n", " interval\n", ")\n", "\n", "scatter = base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')\n", "\n", "scatter & hist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This demo has covered a number of the available components of Altair.\n", "In the following sections, we'll look into each of these a bit more systematically." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }