Simple Visualizations with data.vgplot
¶
The core interface of pdvega
is the vgplot
attribute that it adds to
Pandas DataFrame
and Series
objects:
import pdvega
Like the plot
attribute that is built-in to Pandas, there are two ways of
creating plots with vgplot
: first, you can call the vgplot
attribute
of a Pandas object directly:
from vega_datasets import data
iris = data.iris()
iris.vgplot(kind='scatter', x='sepalLength', y='petalLength', c='species')
Equivalently, you can call the specific method associated with each plot type:
iris.vgplot.scatter(x='sepalLength', y='petalLength', c='species')
The benefit of the second approach is that it allows exploration of available plot types via tab completion, and the individual functions also provide more detailed documentation of the arguments available for each method.
The vgplot
interface exposes nine basic plot types; we will show examples
of these below.
Datasets¶
For the examples on this page, we will use a number of datasets made available in the vega_datasets package:
iris = data.iris()
stocks = data.stocks(pivoted=True)
cars = data.cars()
These datasets are stored in the form of pandas dataframes:
>>> iris.head()
petalLength petalWidth sepalLength sepalWidth species
0 1.4 0.2 5.1 3.5 setosa
1 1.4 0.2 4.9 3.0 setosa
2 1.3 0.2 4.7 3.2 setosa
3 1.5 0.2 4.6 3.1 setosa
4 1.4 0.2 5.0 3.6 setosa
>>> stocks.head()
symbol AAPL AMZN GOOG IBM MSFT
date
2000-01-01 25.94 64.56 NaN 100.52 39.81
2000-02-01 28.66 68.87 NaN 92.11 36.35
2000-03-01 33.95 67.00 NaN 106.11 43.22
2000-04-01 31.01 55.19 NaN 99.95 28.37
2000-05-01 21.00 48.31 NaN 96.31 25.45
>>> cars.head()
Acceleration Cylinders Displacement Horsepower Miles_per_Gallon \
0 12.0 8 307.0 130.0 18.0
1 11.5 8 350.0 165.0 15.0
2 11.0 8 318.0 150.0 18.0
3 12.0 8 304.0 150.0 16.0
4 10.5 8 302.0 140.0 17.0
Name Origin Weight_in_lbs Year
0 chevrolet chevelle malibu USA 3504 1970-01-01
1 buick skylark 320 USA 3693 1970-01-01
2 plymouth satellite USA 3436 1970-01-01
3 amc rebel sst USA 3433 1970-01-01
4 ford torino USA 3449 1970-01-01
Line Plots with vgplot.line
¶
The default plot type for vgplot
is a line plot:
stocks.vgplot()
Unless otherwise specified, the index of the DataFrame or series is used as the x-axis variable, and a separate line will be created for the y-values in each column in the dataframe. If you’d like to plot a subset of the columns, you can use pandas indexing to select the columns you are interested in:
stocks[['AAPL', 'AMZN']].vgplot.line()
Optionally, you can specify the column names to use for the x-axis and y-axis:
stocks.vgplot.line(x='AAPL', y='AMZN')
Line plots can be further customized; see the function documentation for more information:
- Series line plot:
pdvega.SeriesPlotMethods.line()
- DataFrame line plot:
pdvega.FramePlotMethods.line()
Scatter Plots with vgplot.scatter
¶
The previous plot might make more sense in the form of a scatter plot.
This can be done with vgplot.scatter()
:
stocks.vgplot.scatter(x='AAPL', y='AMZN')
You can also encode the color and size of scatter plots; let’s switch to the cars dataset to see the relationship between some of these variables:
cars.vgplot.scatter(x='Horsepower', y='Miles_per_Gallon',
c='Origin', s='Weight_in_lbs', alpha=0.5)
This is one slight difference from the Pandas plot interface: in Pandas the
c
and s
parameters must be passed as arrays, while here we pass them
as column names.
Scatter plots can be further customized; see pdvega.FramePlotMethods.scatter()
for more information.
Area Plots with vgplot.area
¶
Area plots are quite similar to line plots, but curves are filled and stacked, meaning the top curve reflects the sum of all the ones below:
stocks[['MSFT', 'AAPL', 'AMZN']].vgplot.area()
Area charts can also be unstacked and overlaid, in which case transparency can be useful:
stocks[['MSFT', 'AAPL', 'AMZN']].vgplot.area(stacked=False, alpha=0.4)
Area plots can be further customized; see the function documentation for more information:
- Series area plot:
pdvega.SeriesPlotMethods.area()
- DataFrame area plot:
pdvega.FramePlotMethods.area()
Bar Charts with vgplot.bar
¶
Bar charts are supported using vgplot.bar()
. Let’s create a small dataset
to use for this:
import numpy as np
import pandas as pd
np.random.seed(1234)
df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
df.vgplot.bar()
Multiple bar plots will be layered on top of each other; like with area charts,
they can be stacked using the stacked=True
option:
df.vgplot.bar(stacked=True)
Additionally, horizontal bar plots can be created with barh
:
df.vgplot.barh(stacked=True)
Bar charts can be further customized; see the function documentation for more information:
- Series bar plots:
pdvega.SeriesPlotMethods.bar()
,pdvega.SeriesPlotMethods.barh()
- DataFrame bar plots:
pdvega.FramePlotMethods.bar()
,pdvega.FramePlotMethods.barh()
Histograms with vgplot.hist
¶
Histograms can be created with the vgplot.hist()
method.
Let’s create some data to make some distributions:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randn(1000) + 1,
'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1},
columns=['a', 'b', 'c'])
We’ll specify 50 bins, and create a layered histogram with a 50% transparency:
df.vgplot.hist(bins=50, alpha=0.5)
Alternatively, we can stack the histogram, and use histtype
to specify that
we want a filled step chart rather than a bar chart:
df.vgplot.hist(histtype='stepfilled', stacked=True, bins=50)
Histograms can be further customized; see the function documentation for more information:
- Series histogram:
pdvega.SeriesPlotMethods.hist()
- DataFrame histogram:
pdvega.FramePlotMethods.hist()
KDE/Density plots with vgplot.kde
¶
Similar to a histogram is a kernel density estimation plot (kde) which creates
a smooth curve representing the density of points. This can be created with
the vgplot.kde
method. We’ll use the same data we did in the histogram
section:
df.vgplot.kde()
KDE plots can be further customized; see the function documentation for more information:
- Series kde plots:
pdvega.SeriesPlotMethods.kde()
- DataFrame kde plots:
pdvega.FramePlotMethods.kde()
Pie Charts¶
No.
Heatmaps¶
Pandas plotting has a function to create a hexagonally-binned heatmap of
two-dimensional data. Unfortunately neither Vega nor Vega-Lite currently
support hexagonal binning. But they do support cartesian heatmaps, and this
functionality is included in pdvega
:
df.vgplot.heatmap(x='a', y='b', gridsize=20)
Here the gridsize
parameter indicates approximately how many grid points
span the plot. Alternatively, instead of computing the count within each bin,
we can compute the mean of a third column, specified by the C
parameter:
df.vgplot.heatmap(x='a', y='b', C='c', gridsize=20)
Heatmap plots can be further customized; see pdvega.FramePlotMethods.heatmap()
for more information.
Other Plot Types¶
The above plots are the basic plot types supported by pdvega
; more sophisticated
plots are available in the pdvega.plotting
module.
For examples of these, refer to Statistical Visualization with pdvega.plotting.