2 - Descriptive Statistics

Statistic is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions (Ross, 2010). The part of statistics concerned with the description and summarization of data is called descriptive statistics (Ross, 2010).

Basic Descriptive Statistics in python

In the following, we will learn how to perform basic statistical operations in Python, useful for geological applications. In detail, we will define some descriptive indices and parameter to define: 1) the cental tendency (i.e., one or more indices defining the position of the studied dataset), 2) the dispersion (i.e., its variability), and 3) the shape of a dataset (e.g., its shape).

As in the previous example, we will use the dataset from Smith et al. (2011):

In [3]:
import pandas as pd

myDataset = pd.read_excel('Smith_glass_post_NYT_data.xlsx', sheet_name='Supp_traces')

Central tendency of a dataset

In descriptive statistics, it is useful to represent an entire dataset with a single value describing the "middle" or "average" value. That single value is defined as the central tendency. The mean, the median, and the mode are all ways to describe it.

Means

The arithmetic mean $ \mu _{A} $ is the average of all numbers defined as:

\begin{equation} {\mu_A={\frac {1}{n}}\sum _{i=1}^{n}z_{i}={\frac {z_{1}+z_{2}+\cdots +z_{n}}{n}}} \label{eq:ArithmeticMean}\end{equation}

The geometric mean $ \mu _{G} $ is a type of mean, which indicates the central tendency of a dataset by using the product of their values:

\begin{equation} {\mu_G= ( z_{1} z_{2} \cdots z_{n} ) ^\frac{1} {n}} \label{eq:GeometricMean}\end{equation}

Finally, the harmonic mean $ \mu _{H} $ can be expressed as follow:

\begin{equation} \mu _H = \frac{n}{\frac{1}{z_{1}} + \frac{1}{z_{2}} + \cdots + \frac{1}{z_{n}}} \label{eq:HarmonicMean}\end{equation}

In the following, one of the ways (remember, there are also many ways to get the solution of a problem in Python) for getting the different means for a specific feature (in our case a chemical element like Zirconium, Zr) in the imported dataset is reported:

In [6]:
from scipy.stats.mstats import gmean, hmean
import matplotlib.pyplot as plt

print ('-------')

print ('arithmetic mean')
a_mean = myDataset.Zr.mean()
print ("{0:.1f}".format(a_mean))

print ('-------')

print ('geometric mean')
g_mean = gmean(myDataset['Zr'])
print ("{0:.1f}".format(g_mean))

print ('-------')

print ('harmonic mean')
h_mean = hmean(myDataset['Zr'])
print ("{0:.1f}".format(h_mean))

print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= "auto", density = True, label="Measurements Hist") 
plt.axvline(a_mean, color="red", label="Arithmetic mean")
plt.axvline(g_mean, color="black",  label="Geometric mean")
plt.axvline(h_mean, color="green",  label="Harmonic mean")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
-------
arithmetic mean
365.4
-------
geometric mean
348.6
-------
harmonic mean
333.8
-------
<Figure size 640x480 with 1 Axes>

the operator "{0:.1f}".format()</a> simply reduces a number to a string and reduces decimals to 1. We will discuss the meaning of significative decimals later in Capter XX, in the following, we will arbitrary approximate all the resuts to one significative decimal.

Some more exapmles:

In [29]:
a_mean = myDataset.Zr.mean()

print("{0:.0f}".format(a_mean))
print("{0:.1f}".format(a_mean))
print("{0:.2f}".format(a_mean))
print("{0:.4f}".format(a_mean))
365
365.4
365.38
365.3774

Median

The median, $Me$, is the number at the middle of a dataset after a sorting from the lower to the higher value. As a consequence, to get the median value of a dataset, you have to order the data values from smallest to largest. If the number of data values is odd, then the sample median is the middle value in the ordered list; if it is even, then the sample median is the average of the two middle values (Ross, 2010).

In [6]:
print ('-------')
print ('median')
median = myDataset.Zr.median()
print("{0:.1f}".format(median))
print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= "auto", density = True, label="Measurements Hist") 
plt.axvline(median, color="red", label="Median")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
-------
median
339.4
-------

Mode

The mode, $Mo$, of a dataset is the value that appears most frequently in the data set (Ross, 2010). In python, it is possible to retreive the modal value using the following comand:

In [7]:
from scipy.stats.mstats import mode
print ('modal value')
modal_value = mode(myDataset['Zr'].astype(int))[0]
print (modal_value[0])

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(modal_value[0], color="red", label="Modal value")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
modal value
253.0

Dispersion of a dataset

We have just introduced several estimators of the central tendency of a dataset. However we have not yet considered any measure of its variability. The range, the variance and the standard deviation are all estimators of the dispersion (i.e., variability) of a dataset.

Range

A first grossy estimator of the variability of a dataset is provided by the range. The range, $ R $, is the difference between the highest and lowest values in the the data set:

\begin{equation} {R= ( z_{max} - z_{min} ) } \label{eq:Range}\end{equation}

In Pandas, we can easily estimate the range as follow:

In [32]:
print ('-------')

print ('Range')
r = myDataset['Zr'].max()- myDataset['Zr'].min()
print("{0:.1f}".format(r))

print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(myDataset['Zr'].max(), color="red", label="Max value")
plt.axvline(myDataset['Zr'].min(), color="red", label="Min value")
plt.axvspan(myDataset['Zr'].min(), myDataset['Zr'].max(), alpha=0.1, color='red', label="Range")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
-------
Range
734.8
-------

Variance and Standard Deviation

The variance, $ \sigma^2 $, of a dataset is defined as the averaged squared deviations from the sample mean:

\begin{equation} {\sigma^2 = \frac{\sum_{i=1}^{n}(z_i - \mu)^2} {n}} \label{eq:Variance}\end{equation}

The standard deviation, $ \sigma $, is the square root of thr variance:

\begin{equation} {\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{n}(z_i - \mu)^2} {n}}} \label{eq:StandardDeviation}\end{equation}

The variance and the standard deviation of a dataset are easily derived in Pandas using the following commands:

In [35]:
print ('-------')
print ('Variance')
variance =  myDataset['Zr'].var()
print("{0:.1f}".format(variance))
print ('-------')
print ('Standard Deviation')
stddev =  myDataset['Zr'].std()
print("{0:.1f}".format(stddev))

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(myDataset['Zr'].mean()-stddev, color="red", label="Max value")
plt.axvline(myDataset['Zr'].mean()+stddev, color="red", label="Min value")
plt.axvspan(myDataset['Zr'].mean()-stddev, myDataset['Zr'].mean()+stddev, alpha=0.1, color='red', label="mean +/- 1-sigma")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
-------
Variance
14020.9
-------
Standard Deviation
118.4

Shape of a dataset

After the introduction of some parameters providing information about the central tendency and the variability of a data set, let's start analyzing an index giving some constraints about of the shape of a distribution: the skewness.

Skewness

The skewness is a statistical parameter providing information about the symmetry in a distribution. In the case of a symmetric distribution, the arithmetic mean, the median and the mode express the same value. As a consequence, $Mo = Me = \mu_A$. Please note the coincidence of these three values, altought being a necessary condition for symmetric distributions, does not guarantee the symmetry of a distribution. On the contrary, the non-coincidence of these three parameters point to a skewed distribution. In particular, in the cases where $Mo < Me < \mu_A$ and $\mu_A < Me < Mo$ the distribution is characterized by a tail on the right and left side respectively.

In the specific case of Zr distribution of our dataset, where $Mo < Me < \mu_A$ dysplayng a tail on the right side, as expected.

In [37]:
plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(modal_value[0], color="red", label="Modal Value")
plt.axvline(median, color="black",  label="Median Value")
plt.axvline(a_mean, color="green",  label="Arithmetic mean")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

A parameter providing us information about the skewness of a distribution is the Pearson’s first coefficient of skewness:

\begin{equation} \alpha _1 =\frac{3 \left ( \mu -Mo \right )}{\sigma } \label{eq:Pearson1Skew}\end{equation}

A second parameter providing us information about the skewness of a distribution is the Pearson’s second moment of skewness:

\begin{equation} \alpha _2 =\frac{3 \left ( \mu -Me \right )}{\sigma } \label{eq:Pearson2Skew}\end{equation}

An addittional parameter providing us information about the skewness of a distribution is the Fisher-Pearson's coefficient of skewness:

\begin{equation} {g_1 =\frac{\frac{1}{n}\sum_{i=1}^{n} \left ({ z_i - \mu}\right )^3} {\sigma^3} } \label{eq:FisherIndexSkew}\end{equation}

In pyton, the $\alpha _1$, $\alpha _2$, and $g_1$ parameters can be determinedas follow:

In [8]:
from scipy.stats import skew

print ("Pearson’s first coefficient of skewness")
mean = myDataset['Zr'].mean()
median = myDataset['Zr'].median()
standard_deviation = myDataset['Zr'].median()
a1=3*(mean-median)/standard_deviation
print("{0:.1f}".format(a1))
print ('-------')
print ('Pearson’s second moment of skewness')
modal_value = mode(myDataset['Zr'].astype(int))[0]
a2=3*(mean-modal_value[0])/standard_deviation
print("{0:.1f}".format(a2))
print ('-------')
print ('Fisher-Pearson’s coefficient of skewness')
g1 = skew(myDataset['Zr'])
print("{0:.1f}".format(g1))
print ('-------')
Pearson’s first coefficient of skewness
0.2
-------
Pearson’s second moment of skewness
1.0
-------
Fisher-Pearson’s coefficient of skewness
1.3
-------

Descriptive statistics in Pandas

As reported in the official documentation of Pandas, the command describe() "generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."

In [62]:
myDataset[['Ba','Sr','Zr','La']].describe()
Out[62]:
Ba Sr Zr La
count 370.000000 369.000000 370.000000 370.000000
mean 789.733259 516.422115 365.377397 74.861088
std 523.974960 241.922439 118.409962 18.256772
min 0.000000 9.541958 185.416567 45.323289
25% 297.402777 319.667551 274.660242 61.745228
50% 768.562055 490.111131 339.412064 71.642167
75% 1278.422645 728.726116 438.847368 83.670805
max 2028.221963 1056.132069 920.174406 169.550008