In addition to summary statistics, data visualization helps to understand the data characteristics and how different variables are related.
There are many examples of data visualization with Matplotlib, Seaborn, and Plotly. In this tutorial, we go through a few examples for showing:
Each plot is shown with one of the graphing packages. Matplotlib is a base-level Python package, Seaborn is uses matplotlib and automates more complex plots, and Plotly creates engaging interactive plots.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
Run the next cell to:
n
linearly spaced values betweeen 0
and n-1
with np.linspace(start,end,count)
np.random.rand(count)
np.random.normal(mean,std,count)
y[i]*0.1
staying within the range -3
to 3
tt
, x
, y
, and z
with a vertical stack np.vstack
and transpose .T
for column oriented datatt
, x
, y
, and z
import numpy as np
import pandas as pd
np.random.seed(0) # change seed for different answer
n = 1000
tt = np.linspace(0,n-1,n)
x = np.random.rand(n)+tt/500
y = np.random.normal(0,x,n)
z = [0]
for i in range(1,n):
z.append(min(max(-3,z[i-1]+y[i]*0.1),3))
data = pd.DataFrame(np.vstack((tt,x,y,z)).T,\
columns=['time','x','y','z'])
data['w'] = '0-499'
for i in range(int(n/2),n):
data.at[i,'w'] = '500-999'
data.head()
A line plot is the most basic type. There is an introductory tutorial on plots in the Begin Python Course, Lesson 12. Visit that course module if you need additional information on basic plots such as plt.plot()
plt.plot(tt,z)
plt.show()
The line plot can also be improved with customized trend styles. Below is an example with common options.
c=Colors
============= ===============================
character color
============= ===============================
``'b'`` blue
``'g'`` green
``'r'`` red
``'y'`` yellow
``'k'`` black
============= ===============================
m=Markers
============= ===============================
character description
============= ===============================
``'.'`` point marker
``'o'`` circle marker
``'s'`` square marker
``'^'`` triangle marker
``'*'`` star marker
============= ===============================
ln=Line Styles
============= ===============================
character description
============= ===============================
``'-'`` solid line style
``'--'`` dashed line style
``'-.'`` dash-dot line style
``':'`` dotted line style
============= ===============================
plt.figure(1,figsize=(10,6)) # adjust figure size
ax=plt.subplot(2,1,1) # subplot 1
plt.plot(tt,z,'r-',linewidth=3,label='z') # plot red line
ax.grid() # add grid
plt.ylabel('z'); plt.legend() # add ylabel, legend
plt.subplot(2,1,2) # subplot 2
plt.plot(tt,x,'b.',label='x') # plot blue dots
plt.plot(tt,y,color='orange',label='y',alpha=0.7) # plot orange line
plt.xlabel('time'); plt.legend() # labels
plt.savefig('04-myFig.png',transparent=True,dpi=600) # save figure
plt.show() # show plot
Create a plot that displays the data:
xt = [0,0.1,0.2,0.3,0.5,0.8,1.0]
yt = [1.0,2.1,3.5,6.5,7.2,5.9,6.3]
Scatter plots are similar to regular plots but they show individuals points instead of values connected in series. Matplotlib and Plotly are used in this example. Matplotlib is fast and simple while Plotly has features for interactive plots.
# matplotlib
plt.scatter(x,y)
plt.show()
# plotly
fig = px.scatter(data,x='x',y='y',color='w',size='x',hover_data=['w'])
fig.show()
Create a scatter plot with matplotlib
or plotly
that displays xt
paired with yt
and zt
:
xt = np.array([0,0.1,0.2,0.3,0.5,0.8,1.0])
yt = np.array([1.0,2.1,3.5,6.5,7.2,5.9,6.3])
zt = xt*yt
Change the shape of the points to a square for yt
and a triangle for zt
. Add a label to indicate which points are yt
and zt
.
Bar charts show a histogram distribution of count in a bin range. The alpha
option is the transparency between 0
and 1
. A value of 0.7
is a good value to use to show the overlying and underlying data.
bins = np.linspace(-3,3,31)
plt.hist(y,bins,label='y')
plt.hist(z,bins,alpha=0.7,label='z')
plt.legend()
plt.show()
Create a bar plot that displays the distribution of xt
, yt
, and zt
:
nt = 1000
xt = np.random.rand(nt)
yt = np.random.normal(0,1,nt)
zt = xt*yt
Use bins = np.linspace(-3,3,31)
to create the histogram distrubtion.
A pair plot shows the correlation between variables. It has bar distributions on the diagonal and scatter plots on the off-diagonal. A pair plot also shows a different color (hue
) by category w
. Pair plots show correlations between pairs of variables that may be related and gives a good indication of features (explanatory inputs) that are used for classification or regression.
sns.pairplot(data[['x','y','z','w']],hue=('w'))
plt.show()
Create a pair plot that displays the correlation between xt
, yt
, and zt
between the first 500 and second 500 random numbers that are categorized as Dist
. Create a pandas
dataframe with:
nt = 100
xt = np.random.rand(nt)
yt = np.random.normal(0,1,nt)
zt = xt*yt
dt = pd.DataFrame(np.column_stack([xt,yt,zt]),columns=['xt','yt','zt'])
dt['Dist'] = 'First'
for i in range(int(nt/2),nt):
dt.at[i,'Dist'] = 'Second'
A box plot shows data quartiles. In this case, we are comparing the first 500 points with the last 500 points.
sns.boxplot(x='w',y='x',data=data)
plt.show()
Create a box plot that shows the quartiles of yt
by first and second sets as indicated in Dist
.
A voilin plot combines the box plot quartiles with the distribution.
sns.violinplot(x='w',y='x',data=data,size=6)
plt.show()
Create a violin plot that shows the quartiles and distribution of zt
by first and second sets as indicated in Dist
in the DataFrame dt
.
A joint plot shows two variables, with the univariate and joint distributions. Try kind='reg'
, 'kde'
, and 'hex'
to see different joint plot styles.
sns.jointplot('x','z',data=data,kind="kde")
plt.show()
Create a joint plot that shows the joint distribution of yt
and zt
in the DataFrame dt
.
A sample data file loads if you do not have a TCLab connected. Otherwise, generate a file from the TCLab data with seconds (t
), heater levels (Q1
and Q2
), and temperatures (lab.T1
and lab.T2
). Record data every second for 120 seconds and change the heater levels every 30 seconds to a random number between 0 and 80 with np.random.randint()
. There is no need to change this program, only run it to collect the data over 2 minutes.
import tclab, time, csv
import numpy as np
try:
n = 120
with open('04-tclab.csv',mode='w',newline='') as f:
cw = csv.writer(f)
cw.writerow(['Time','Q1','Q2','T1','T2'])
with tclab.TCLab() as lab:
print('t Q1 Q2 T1 T2')
for t in range(n):
if t%30==0:
Q1 = np.random.randint(0,81)
Q2 = np.random.randint(0,81)
lab.Q1(Q1); lab.Q2(Q2)
cw.writerow([t,Q1,Q2,lab.T1,lab.T2])
if t%5==0:
print(t,Q1,Q2,lab.T1,lab.T2)
time.sleep(1)
data4=pd.read_csv('04-tclab.csv')
except:
print('Connect TCLab to generate data')
url = 'http://apmonitor.com/do/uploads/Main/tclab_dyn_data2.txt'
data4=pd.read_csv(url)
data4.columns = ['Time','Q1','Q2','T1','T2']
data4.head()
Analyze Q1
, Q2
, T1
, and T2
graphically with a time series plot and a pair plot. The time series plot should show Q1
and Q2
in the upper subplot and T1
and T2
in the lower subplot. The pair plot should be a 2x2
plot grid that shows the heater / temperature pairs as Q1
/T1
, Q2
/T2
.