We start learning Python for the analysis of geological data performing two basic everyday operations. They are the importing of a dataset in Pandas and the subsequent visualization of selected features in a binary diagram. Let'start with the importing of a dataset using the pandas library, but what is pandas? Pandas is a pyton libary (i.e., a tool) designed to help us in working with structured data. In the practice, it provides us several, ready to use, commands to work with data. As an example, we can easly use pandas to import a dataset stored in a text file or an Excel worksheet using a single row of code. To understand, look at these two following examples:
import pandas as pd
#Exampe 1
myDataset1 = pd.read_csv('Smith_glass_post_NYT_data.csv')
#Exampe 2
myDataset2 = pd.read_excel('Smith_glass_post_NYT_data.xlsx', sheet_name='Supp_traces')
In the first example, we define a pandas DataFrame (i.e., myDataset1) reading a comma delimited, text file. As repoted in the official documentation of the Pandas library, a DataFrame "is a 2-dimensional labeled data structure with columns of potentially different types". What does it mean? We can imagine a dataframe as a fully editable, powerful table:
from IPython.display import display
display(myDataset1)
The second example is similar to the first one but it reads an Excel file. Also, being an Excel file potentially made of several spreadsheets, it point to a specific one: Supp_traces. The imported dataset contains trace element chemical concetrations of volcanic products (i.e. tephras) published in a scientific contribution by Smith et al. (2011). It will be used as a representative proxy of a scientific dataset. In detail, it consists of major (Supp_majors) and trace element (Supp_traces) analyses of tephra samples belonging to the recent activity (last 15 Ky) of the Campi Flergrei Caldera.
To start plotting, we can use the matplotlib library. As reported in the official documentation, it is "a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms." In detail, it "can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc., with just a few lines of code". Let's start plotting:
import matplotlib.pyplot as plt
x = myDataset1.Zr
y = myDataset1.Th
plt.scatter(x, y)
plt.show()
Now, we can start adding features to the diagram. They could be a title or axis labels:
plt.figure()
plt.scatter(x, y)
plt.title("My First Diagram")
plt.xlabel("Zr [ppm]")
plt.ylabel("Th [ppm]")
plt.show()
To start improving our knowledge about the use of python in the visualization of scientific data, we are going to show how to filter or slice our dataset. As an example, we can plot the analyses cheracterized by Zr contents major and minor than 450 ppm in blue and red respectively, also adding a legend.
# Define two sub-dataset for Zr>450 and Zr<450 respectively
mySubDataset1= myDataset1[myDataset1.Zr> 450]
mySubDataset2= myDataset1[myDataset1.Zr< 450]
#generate a new picture
plt.figure()
# Generate the scatter Zr Vs Th diagram for Zr > 450
# in blue also defining the legend caption as "Zr > 450 [ppm]"
x1 = mySubDataset1.Zr
y1 = mySubDataset1.Th
plt.scatter(x1, y1, color='blue', label= "Zr > 450 [ppm]")
# Generate the scatter Zr Vs Th diagram for Zr < 450
# in red also defining the legend caption as "Zr < 450 [ppm]"
x2 = mySubDataset2.Zr
y2 = mySubDataset2.Th
plt.scatter(x2, y2, color='red', label= "Zr < 450 [ppm]")
plt.title("My Second Diagram")
plt.xlabel("Zr [ppm]")
plt.ylabel("Th [ppm]")
# generate the legend
plt.legend()
plt.show()
Now, we are going to learn how to filter our dataset using the values reported in the column 'Epoch' (i.e., 1, 2, 3, and 3d) that subdivide the eruptions studied by Smith et al. (2011) in four different periods. We will sart ploting the different Epochs with different colors and labels:
plt.figure()
myData1 = myDataset1[(myDataset1.Epoch.astype(str) == '1')]
plt.scatter(myData1.Zr, myData1.Th, label='Epoch 1')
myData2 = myDataset1[(myDataset1.Epoch.astype(str) == '2')]
plt.scatter(myData2.Zr, myData2.Th, label='Epoch 2')
myData3 = myDataset1[(myDataset1.Epoch.astype(str) == '3')]
plt.scatter(myData3.Zr, myData3.Th, label='Epoch 3')
myData4 = myDataset1[(myDataset1.Epoch.astype(str) == '3b')]
plt.scatter(myData4.Zr, myData4.Th, label='Epoch 3b')
plt.title("My Third Diagram")
plt.xlabel("Zr [ppm]")
plt.ylabel("Th [ppm]")
plt.legend()
plt.show()
The readers that are already familiar with the python progamming languages could suggest a way to compress the code reported above making it more coincise:
epochs = ['1','2','3','3b']
plt.figure()
for epoch in epochs:
myData = myDataset1[(myDataset1.Epoch.astype(str) == epoch)]
plt.scatter(myData.Zr, myData.Th, label="Epoch " + epoch)
plt.title("My Third Diagram again")
plt.xlabel("Zr [ppm]")
plt.ylabel("Th [ppm]")
plt.legend()
plt.show()
In python the for loop is utilized to repeat a block of code. You should learn how to use it toghrter with the other compound statements. We will describe the compound statements later in the book (Cap XX). However, please note that you can succesfully complete many tasks witout a deep knowledge of the syntax and “core semantics” of the python language.
Finally, we will plot the different ephocs in different subplots, also setting the same values for the x and y axes:
plt.figure()
f, axarr = plt.subplots(2, 2)
axarr[0, 0].scatter(myData1.Zr, myData1.Th, label='Epoch 1')
axarr[0, 0].set_xlabel("Zr [ppm]")
axarr[0, 0].set_ylabel("Th [ppm]")
axarr[0, 0].set_xlim([100, 1000])
axarr[0, 0].set_ylim([0, 100])
axarr[0, 0].legend()
axarr[1, 0].scatter(myData2.Zr, myData2.Th, label='Epoch 2')
axarr[1, 0].set_xlabel("Zr [ppm]")
axarr[1, 0].set_ylabel("Th [ppm]")
axarr[1, 0].set_xlim([100, 1000])
axarr[1, 0].set_ylim([0, 100])
axarr[1, 0].legend()
axarr[0, 1].scatter(myData3.Zr, myData3.Th, label='Epoch 3')
axarr[0, 1].set_xlabel("Zr [ppm]")
axarr[0, 1].set_ylabel("Th [ppm]")
axarr[0, 1].set_xlim([100, 1000])
axarr[0, 1].set_ylim([0, 100])
axarr[0, 1].legend()
axarr[1, 1].scatter(myData4.Zr, myData4.Th, label='Epoch 3b')
axarr[1, 1].set_xlabel("Zr [ppm]")
axarr[1, 1].set_ylabel("Th [ppm]")
axarr[1, 1].set_xlim([100, 1000])
axarr[1, 1].set_ylim([0, 100])
axarr[1, 1].legend()
plt.tight_layout()
plt.show()
More examples and details are provided by the official documentation of the matplotlib library.