Exploring a DataSet¶

Now that we have seen the basic pieces of Altair’s API, it’s time to practice using it to explore a new dataset. With your partner, choose one of the following four datasets, detailed below.

As you explore the data, recall the building blocks we’ve discussed:

various marks: mark_point(), mark_line(), mark_tick(), mark_bar(), mark_area(), mark_rect(), etc.
various encodings: x, y, color, shape, size, row, column, text, tooltip, etc.
binning and aggregations: a List of available aggregations can be found in Altair’s documentation
stacking and layering (alt.layer <-> +, alt.hconcat <-> |, alt.vconcat <-> &)

Start simple and build from there. Which encodings work best with quantitative data? With categorical data? What can you learn about your dataset using these tools?

We’ll set aside about 20 minutes for you to work on this with your partner.

from vega_datasets import data

Seattle Weather¶

This data includes daily precipitation, temperature range, wind speed, and weather type as a function of date between 2012 and 2015 in Seattle.

weather = data.seattle_weather()
weather.head()

	date	precipitation	temp_max	temp_min	wind	weather
0	2012-01-01	0.0	12.8	5.0	4.7	drizzle
1	2012-01-02	10.9	10.6	2.8	4.5	rain
2	2012-01-03	0.8	11.7	7.2	2.3	rain
3	2012-01-04	20.3	12.2	5.6	4.7	rain
4	2012-01-05	1.3	8.9	2.8	6.1	rain

Gapminder¶

This data consists of population, fertility, and life expectancy over time in a number of countries around the world.

Note that, while you may be tempted to use a temporal encoding for the year, here the year is simply a number, not a date stamp, and so temporal encoding is not the best choice here.

gapminder = data.gapminder()
gapminder.head()

	year	country	pop	life_expect	fertility
0	1955	Afghanistan	8891209	30.332	7.7
1	1960	Afghanistan	9829450	31.997	7.7
2	1965	Afghanistan	10997885	34.020	7.7
3	1970	Afghanistan	12430623	36.088	7.7
4	1975	Afghanistan	14132019	38.438	7.7

Population¶

This data contains the US population sub-divided by age and sex every decade from 1850 to near the present.

Note that, while you may be tempted to use a temporal encoding for the year, here the year is simply a number, not a date stamp, and so temporal encoding is not the best choice.

population = data.population()
population.head()

	year	age	sex	people
0	1850	0	1	1483789
1	1850	0	2	1450376
2	1850	5	1	1411067
3	1850	5	2	1359668
4	1850	10	1	1260099

Movies¶

The movies dataset has data on 3200 movies, including release date, budget, and ratings on IMDB and Rotten Tomatoes.

movies = data.movies()
movies.head()

	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	None	None	NaN	6.1	1071.0
1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	None	None	NaN	6.9	207.0
2	I Married a Strange Person	203134.0	203134.0	NaN	250000.0	Aug 28 1998	None	NaN	Lionsgate	None	Comedy	None	None	NaN	6.8	865.0
3	Let's Talk About Sex	373615.0	373615.0	NaN	300000.0	Sep 11 1998	None	NaN	Fine Line	None	Comedy	None	None	13.0	NaN	NaN
4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	None	62.0	3.4	165.0

Altair Tutorial

Exploring a DataSet¶

Seattle Weather¶

Gapminder¶

Population¶

Movies¶