3 - Graphical Visualization of a Dataset

The visualization of a dataset is a foundamental task in descriptive statistics as it provides a first, qualitative, general picture of data distribution. In the following, we are going to illustrate how to perform an hitogram frequency diagram, a cumulative density plot and a Box-and-whisker plot using the matplotlib library. Finally we will show some advanced visualization techniques provided by the seaborn library.

As in the previous example, we will use the dataset from Smith et al. (2011):

In [2]:
import pandas as pd

myDataset = pd.read_excel('Smith_glass_post_NYT_data.xlsx', sheetname='Supp_traces')

Histogram diagrams of an univariate distribution

As reported in the offical documentation, the command matplotlib.pyplot.hist computes and draws the histogram of a dataset. As an example, the histogram plot of the absolute frequencies of Zr can be generated as follow:

In [6]:
import matplotlib.pyplot as plt

plt.figure()
x = myDataset.Zr
plt.hist(x, bins = "auto") 
plt.xlabel('Zr [ppm]')
plt.ylabel('Counts')
plt.show()

For furter details about the bins parametrization please refer to the official documentaton.

Similarly, the histogram plot for relative frequencies (i.e., the area under the histogram will sum to 1) can be realized as follow:

In [11]:
plt.figure()
plt.hist(x, bins= "auto", density = True) 
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.show()

A reported in the official documentation, the density = True option will return normalized diagram to form a discrete probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.

Plot a cumulative distribution

The cumulative distribution function (CDF, also cumulative density function) of a distibution, evaluated at x value, is the probability to get values less than or equal to x. Using matplotlib, it can be plotted as follow:

In [25]:
plt.figure()
plt.hist(x, bins='auto', density=True, histtype='step', cumulative=1)
plt.xlabel('Zr [ppm]')
plt.ylabel('Likelihood of occurrence')
plt.show()

Box Plots

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the first and third quartiles, and the band inside the box is always the second quartile (the median). As default option in matplotlib, the ends of the whiskers represent the lowest datum still within 1.5 inter quartile range of the lower quartile, and the highest datum still within 1.5 inter quartile range of the upper quartile. Any data not included between the whiskers is plotted as an outlier with a single symbol. Usin the matplotlib library, a boxplot can be defined as follow:

In [5]:
plt.figure()
plt.boxplot(x)
plt.ylabel('Zr [ppm]')
plt.xticks([1],['all Epochs'])
plt.show()
In [16]:
import seaborn as sns
import numpy as np
plt.figure()
sns.boxplot(x="Epoch", y="Zr", data=myDataset, whis=np.inf, palette="vlag")
plt.show()

Seaborn Pairplots

As reported in the official documentation of the seaborn library, the pairplot function plots pairwise relationships in a dataset. By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

In [9]:
import seaborn as sns

myDataset1 = myDataset[['Ba','Zr','Th']]

plt.figure()
sns.pairplot(myDataset1)
plt.show()
<matplotlib.figure.Figure at 0x1164603c8>

Pease refer to the seaborn official documentation for more captivating examples.

In [15]:
import numpy as np
from scipy.stats import kendalltau
import seaborn as sns
sns.set(style="ticks")

x = myDataset.Ba
y = myDataset.Zr

plt.figure()
sns.jointplot(x, y, kind="hex", color="#4CB391")
plt.colorbar()
plt.show()
<matplotlib.figure.Figure at 0x1166deac8>