2 - Descriptive Statistics¶

Statistic is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions (Ross, 2010). The part of statistics concerned with the description and summarization of data is called descriptive statistics (Ross, 2010).

Basic Descriptive Statistics in python¶

In the following, we will learn how to perform basic statistical operations in Python, useful for geological applications. In detail, we will define some descriptive indices and parameter to define: 1) the cental tendency (i.e., one or more indices defining the position of the studied dataset), 2) the dispersion (i.e., its variability), and 3) the shape of a dataset (e.g., its shape).

As in the previous example, we will use the dataset from Smith et al. (2011):

import pandas as pd

myDataset = pd.read_excel('Smith_glass_post_NYT_data.xlsx', sheet_name='Supp_traces')

Central tendency of a dataset¶

In descriptive statistics, it is useful to represent an entire dataset with a single value describing the "middle" or "average" value. That single value is defined as the central tendency. The mean, the median, and the mode are all ways to describe it.

Means¶

The arithmetic mean $ \mu _{A} $ is the average of all numbers defined as:

\begin{equation} {\mu_A={\frac {1}{n}}\sum _{i=1}^{n}z_{i}={\frac {z_{1}+z_{2}+\cdots +z_{n}}{n}}} \label{eq:ArithmeticMean}\end{equation}

The geometric mean $ \mu _{G} $ is a type of mean, which indicates the central tendency of a dataset by using the product of their values:

\begin{equation} {\mu_G= ( z_{1} z_{2} \cdots z_{n} ) ^\frac{1} {n}} \label{eq:GeometricMean}\end{equation}

Finally, the harmonic mean $ \mu _{H} $ can be expressed as follow:

\begin{equation} \mu _H = \frac{n}{\frac{1}{z_{1}} + \frac{1}{z_{2}} + \cdots + \frac{1}{z_{n}}} \label{eq:HarmonicMean}\end{equation}

In the following, one of the ways (remember, there are also many ways to get the solution of a problem in Python) for getting the different means for a specific feature (in our case a chemical element like Zirconium, Zr) in the imported dataset is reported:

from scipy.stats.mstats import gmean, hmean
import matplotlib.pyplot as plt

print ('-------')

print ('arithmetic mean')
a_mean = myDataset.Zr.mean()
print ("{0:.1f}".format(a_mean))

print ('-------')

print ('geometric mean')
g_mean = gmean(myDataset['Zr'])
print ("{0:.1f}".format(g_mean))

print ('-------')

print ('harmonic mean')
h_mean = hmean(myDataset['Zr'])
print ("{0:.1f}".format(h_mean))

print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= "auto", density = True, label="Measurements Hist") 
plt.axvline(a_mean, color="red", label="Arithmetic mean")
plt.axvline(g_mean, color="black",  label="Geometric mean")
plt.axvline(h_mean, color="green",  label="Harmonic mean")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

-------
arithmetic mean
365.4
-------
geometric mean
348.6
-------
harmonic mean
333.8
-------

<Figure size 640x480 with 1 Axes>

the operator "{0:.1f}".format()</a> simply reduces a number to a string and reduces decimals to 1. We will discuss the meaning of significative decimals later in Capter XX, in the following, we will arbitrary approximate all the resuts to one significative decimal.

Some more exapmles:

a_mean = myDataset.Zr.mean()

print("{0:.0f}".format(a_mean))
print("{0:.1f}".format(a_mean))
print("{0:.2f}".format(a_mean))
print("{0:.4f}".format(a_mean))

365
365.4
365.38
365.3774

Median¶

The median, $Me$, is the number at the middle of a dataset after a sorting from the lower to the higher value. As a consequence, to get the median value of a dataset, you have to order the data values from smallest to largest. If the number of data values is odd, then the sample median is the middle value in the ordered list; if it is even, then the sample median is the average of the two middle values (Ross, 2010).

print ('-------')
print ('median')
median = myDataset.Zr.median()
print("{0:.1f}".format(median))
print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= "auto", density = True, label="Measurements Hist") 
plt.axvline(median, color="red", label="Median")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

-------
median
339.4
-------

Mode¶

The mode, $Mo$, of a dataset is the value that appears most frequently in the data set (Ross, 2010). In python, it is possible to retreive the modal value using the following comand:

from scipy.stats.mstats import mode
print ('modal value')
modal_value = mode(myDataset['Zr'].astype(int))[0]
print (modal_value[0])

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(modal_value[0], color="red", label="Modal value")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

modal value
253.0

Dispersion of a dataset¶

We have just introduced several estimators of the central tendency of a dataset. However we have not yet considered any measure of its variability. The range, the variance and the standard deviation are all estimators of the dispersion (i.e., variability) of a dataset.

Range¶

A first grossy estimator of the variability of a dataset is provided by the range. The range, $ R $, is the difference between the highest and lowest values in the the data set:

\begin{equation} {R= ( z_{max} - z_{min} ) } \label{eq:Range}\end{equation}

In Pandas, we can easily estimate the range as follow:

print ('-------')

print ('Range')
r = myDataset['Zr'].max()- myDataset['Zr'].min()
print("{0:.1f}".format(r))

print ('-------')

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(myDataset['Zr'].max(), color="red", label="Max value")
plt.axvline(myDataset['Zr'].min(), color="red", label="Min value")
plt.axvspan(myDataset['Zr'].min(), myDataset['Zr'].max(), alpha=0.1, color='red', label="Range")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

-------
Range
734.8
-------

Variance and Standard Deviation¶

The variance, $ \sigma^2 $, of a dataset is defined as the averaged squared deviations from the sample mean:

\begin{equation} {\sigma^2 = \frac{\sum_{i=1}^{n}(z_i - \mu)^2} {n}} \label{eq:Variance}\end{equation}

The standard deviation, $ \sigma $, is the square root of thr variance:

\begin{equation} {\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{n}(z_i - \mu)^2} {n}}} \label{eq:StandardDeviation}\end{equation}

The variance and the standard deviation of a dataset are easily derived in Pandas using the following commands:

print ('-------')
print ('Variance')
variance =  myDataset['Zr'].var()
print("{0:.1f}".format(variance))
print ('-------')
print ('Standard Deviation')
stddev =  myDataset['Zr'].std()
print("{0:.1f}".format(stddev))

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(myDataset['Zr'].mean()-stddev, color="red", label="Max value")
plt.axvline(myDataset['Zr'].mean()+stddev, color="red", label="Min value")
plt.axvspan(myDataset['Zr'].mean()-stddev, myDataset['Zr'].mean()+stddev, alpha=0.1, color='red', label="mean +/- 1-sigma")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

-------
Variance
14020.9
-------
Standard Deviation
118.4

Shape of a dataset¶

After the introduction of some parameters providing information about the central tendency and the variability of a data set, let's start analyzing an index giving some constraints about of the shape of a distribution: the skewness.

Skewness¶

The skewness is a statistical parameter providing information about the symmetry in a distribution. In the case of a symmetric distribution, the arithmetic mean, the median and the mode express the same value. As a consequence, $Mo = Me = \mu_A$. Please note the coincidence of these three values, altought being a necessary condition for symmetric distributions, does not guarantee the symmetry of a distribution. On the contrary, the non-coincidence of these three parameters point to a skewed distribution. In particular, in the cases where $Mo < Me < \mu_A$ and $\mu_A < Me < Mo$ the distribution is characterized by a tail on the right and left side respectively.

In the specific case of Zr distribution of our dataset, where $Mo < Me < \mu_A$ dysplayng a tail on the right side, as expected.

plt.figure()
plt.hist(myDataset.Zr, bins= 16, density = True, label="Measurements Hist") 
plt.axvline(modal_value[0], color="red", label="Modal Value")
plt.axvline(median, color="black",  label="Median Value")
plt.axvline(a_mean, color="green",  label="Arithmetic mean")
plt.xlabel('Zr [ppm]')
plt.ylabel('Probability')
plt.legend()
plt.show()

A parameter providing us information about the skewness of a distribution is the Pearson’s first coefficient of skewness:

\begin{equation} \alpha _1 =\frac{3 \left ( \mu -Mo \right )}{\sigma } \label{eq:Pearson1Skew}\end{equation}

A second parameter providing us information about the skewness of a distribution is the Pearson’s second moment of skewness:

\begin{equation} \alpha _2 =\frac{3 \left ( \mu -Me \right )}{\sigma } \label{eq:Pearson2Skew}\end{equation}

An addittional parameter providing us information about the skewness of a distribution is the Fisher-Pearson's coefficient of skewness:

\begin{equation} {g_1 =\frac{\frac{1}{n}\sum_{i=1}^{n} \left ({ z_i - \mu}\right )^3} {\sigma^3} } \label{eq:FisherIndexSkew}\end{equation}

In pyton, the $\alpha _1$, $\alpha _2$, and $g_1$ parameters can be determinedas follow:

from scipy.stats import skew

print ("Pearson’s first coefficient of skewness")
mean = myDataset['Zr'].mean()
median = myDataset['Zr'].median()
standard_deviation = myDataset['Zr'].median()
a1=3*(mean-median)/standard_deviation
print("{0:.1f}".format(a1))
print ('-------')
print ('Pearson’s second moment of skewness')
modal_value = mode(myDataset['Zr'].astype(int))[0]
a2=3*(mean-modal_value[0])/standard_deviation
print("{0:.1f}".format(a2))
print ('-------')
print ('Fisher-Pearson’s coefficient of skewness')
g1 = skew(myDataset['Zr'])
print("{0:.1f}".format(g1))
print ('-------')

Pearson’s first coefficient of skewness
0.2
-------
Pearson’s second moment of skewness
1.0
-------
Fisher-Pearson’s coefficient of skewness
1.3
-------

Descriptive statistics in Pandas¶

As reported in the official documentation of Pandas, the command describe() "generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values."

myDataset[['Ba','Sr','Zr','La']].describe()

	Ba	Sr	Zr	La
count	370.000000	369.000000	370.000000	370.000000
mean	789.733259	516.422115	365.377397	74.861088
std	523.974960	241.922439	118.409962	18.256772
min	0.000000	9.541958	185.416567	45.323289
25%	297.402777	319.667551	274.660242	61.745228
50%	768.562055	490.111131	339.412064	71.642167
75%	1278.422645	728.726116	438.847368	83.670805
max	2028.221963	1056.132069	920.174406	169.550008