Statistical Visualization with pdvega.plotting

In addition to the basic plots made available by the vgplot interface, pdvega.plotting makes available some more sophisticated plotting types that mirror those available in pandas.plotting.

This section will outline a few of these.

Scatter Matrix

For multi-dimensional data, it is difficult to capture all the relevant data features using a simple scatter plot. For data with several attributes, it can be useful to visualize the pairwise relationships between all pairs of dimensions. This is done by pdvega.scatter_matrix, which has an API based on pandas.plotting.scatter_matrix():

pdvega.scatter_matrix(iris, "species", figsize=(7, 7))

Notice that this version is interactive in two ways: if you click and drag on any frame of the plot, all frames scales are dynamically adjusted in concert. Further, if you hold the SHIFT key while clicking and dragging, it enables a linked-brushing operation that allows you to track points between panels.

Parallel Coordinates

Another way to visualize multi-dimensional data is to look at each dimension independently, using a parallel coordinates plot. This can be done using pdvega.parallel_coordinates(), which follows the API of pandas.plotting.parallel_coordinates():

pdvega.parallel_coordinates(iris, "species")

In one glance, this lets you see relationships between points, and in particular makes clear that the “setosa” species is well-separated from the other two in the dimensions of petal width and length.

Andrews Curves

A similar approach to visualizing data dimensions is known as Andrews curves: the idea is to construct a Fourier series from the features of each object, in order to qualitatively visualize the aggregate differences between classes. This can be done with the pdvega.andrews_curves() function, which follows the API of pandas.plotting.andrews_curves():

pdvega.andrews_curves(iris, "species")

This gives us a similar impression to what we saw in the parallel coordinates plot – that setosa is somehow distinct from the other species – but gives less quantitative insight into just which features lead to that distinction.

Lag Plot

Finally, for time series, an interesting type of plot is known as a lag plot. This is implemented by the pdvega.plotting.lag_plot() function, which follows the API of pandas.plotting.lag_plot().

Here we’ll visualize the stock prices of Amazon and Microsoft from 1998-2010, using a lag of 12 months:

pdvega.lag_plot(stocks[['AMZN', 'MSFT']], lag=12)

It’s immediately apparent from this plot that Amazon was far more volitile during that period: its price at any point during this period showed very little correlation with the price a year later. By contrast, it’s clear that Microsoft’s price was much more stable through this decade.

We can see that interpretation as well in the simple time-series plot of each company’s stock price:

stocks[['AMZN', 'MSFT']].vgplot.line()