Exploring Seattle Weather#
(This tutorial is adapted from Vega-Lite’s documentation)
In this tutorial, you’ll learn a few more techniques for creating visualizations in Altair. If you are not familiar with Altair, please read Basic Statistical Visualization first.
For this tutorial, we will create visualizations to explore weather data for Seattle, taken from NOAA. The dataset is a CSV file with columns for the temperature (in Celsius), precipitation (in millimeters), wind speed (in meter/second), and weather type. We have one row for each day from January 1st, 2012 to December 31st, 2015.
Altair is designed to work with data in the form of pandas dataframes, and contains a loader for this and other built-in datasets:
from vega_datasets import data
df = data.seattle_weather()
df.head()
date precipitation temp_max temp_min wind weather
0 2012-01-01 0.0 12.8 5.0 4.7 drizzle
1 2012-01-02 10.9 10.6 2.8 4.5 rain
2 2012-01-03 0.8 11.7 7.2 2.3 rain
3 2012-01-04 20.3 12.2 5.6 4.7 rain
4 2012-01-05 1.3 8.9 2.8 6.1 rain
The data is loaded from the web and stored in a pandas DataFrame, and from here we can explore it with Altair.
Let’s start by looking at the precipitation, using tick marks to see the distribution of precipitation values:
import altair as alt
alt.Chart(df).mark_tick().encode(
x='precipitation',
)
It looks as though precipitation is skewed towards lower values;
that is, when it rains in Seattle, it usually doesn’t rain very much.
It is difficult to see patterns across continuous variables, and so to
better see this, we can create a histogram of the precipitation data.
For this we first discretize the precipitation values by adding a binning
to x
.
Additionally, we set our encoding channel y
with count
.
The result is a histogram of precipitation values:
alt.Chart(df).mark_bar().encode(
alt.X('precipitation').bin(),
y='count()'
)
Next, let’s look at how precipitation in Seattle changes throughout the year.
Altair natively supports dates and discretization of dates when we set the
type to temporal
(shorthand T
).
For example, in the following plot, we compute the total precipitation for each month.
To discretize the data into months, we can use a month
binning (see
TimeUnit for more information about this and
other timeUnit
binnings):
alt.Chart(df).mark_line().encode(
x='month(date):T',
y='average(precipitation)'
)
This chart shows that in Seattle the precipitation in the winter is, on average, much higher than summer (an unsurprising observation to those who live there!). By changing the mapping of encoding channels to data features, you can begin to explore the relationships within the data.
When looking at precipitation and temperature, we might want to aggregate by
year and month (yearmonth
) rather than just month.
This allows us to see seasonal trends, with daily variation smoothed out.
We might also wish to see the maximum and minimum temperature in each month:
alt.Chart(df).mark_line().encode(
x='yearmonth(date):T',
y='max(temp_max)',
)
In this chart, it looks as though the maximum temperature is increasing from year to year over the course of this relatively short baseline. To look closer into this, let’s instead look at the mean of the maximum daily temperatures for each year:
alt.Chart(df).mark_line().encode(
x='year(date):T',
y='mean(temp_max)',
)
This can be a little clearer if we use a bar plot and mark the year as an “ordinal” (ordered category) type. For aesthetic reasons, let’s make the bar chart horizontal by assigning the ordinal value to the y-axis:
alt.Chart(df).mark_bar().encode(
x='mean(temp_max)',
y='year(date):O'
)