4. Visualize Data

Python Data Science

In addition to summary statistics, data visualization helps to understand the data characteristics and how different variables are related.

analyze

There are many examples of data visualization with Matplotlib, Seaborn, and Plotly. In this tutorial, we go through a few examples for showing:

  • time series: line
  • correlated variables: scatter, pair plot
  • data distributions: bar, box, violin, distribution, joint plot

Each plot is shown with one of the graphing packages. Matplotlib is a base-level Python package, Seaborn is uses matplotlib and automates more complex plots, and Plotly creates engaging interactive plots.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px

idea

Generate Data

Run the next cell to:

  • Generate n linearly spaced values betweeen 0 and n-1 with np.linspace(start,end,count)
  • Select random samples from a uniform distribution between 0 and 1 with np.random.rand(count)
  • Select random samples from a normal (Gaussian) distribution with np.random.normal(mean,std,count)
  • Create a time series that changes based on y[i]*0.1 staying within the range -3 to 3
  • Combine tt, x, y, and z with a vertical stack np.vstack and transpose .T for column oriented data
  • Create pandas DataFrame with columns tt, x, y, and z
In [ ]:
import numpy as np
import pandas as pd
np.random.seed(0) # change seed for different answer
n = 1000
tt = np.linspace(0,n-1,n)
x = np.random.rand(n)+tt/500
y = np.random.normal(0,x,n)
z = [0]
for i in range(1,n):
    z.append(min(max(-3,z[i-1]+y[i]*0.1),3))
data = pd.DataFrame(np.vstack((tt,x,y,z)).T,\
                    columns=['time','x','y','z'])
data['w'] = '0-499'
for i in range(int(n/2),n):
    data.at[i,'w'] = '500-999'
data.head()

idea

Plot

A line plot is the most basic type. There is an introductory tutorial on plots in the Begin Python Course, Lesson 12. Visit that course module if you need additional information on basic plots such as plt.plot()

In [ ]:
plt.plot(tt,z)
plt.show()

The line plot can also be improved with customized trend styles. Below is an example with common options.

c=Colors

=============    ===============================
character        color
=============    ===============================
``'b'``          blue
``'g'``          green
``'r'``          red
``'y'``          yellow
``'k'``          black
=============    ===============================

m=Markers

=============    ===============================
character        description
=============    ===============================
``'.'``          point marker
``'o'``          circle marker
``'s'``          square marker
``'^'``          triangle marker
``'*'``          star marker
=============    ===============================

ln=Line Styles

=============    ===============================
character        description
=============    ===============================
``'-'``          solid line style
``'--'``         dashed line style
``'-.'``         dash-dot line style
``':'``          dotted line style
=============    ===============================
In [ ]:
plt.figure(1,figsize=(10,6))                         # adjust figure size
ax=plt.subplot(2,1,1)                                # subplot 1
plt.plot(tt,z,'r-',linewidth=3,label='z')            # plot red line
ax.grid()                                            # add grid
plt.ylabel('z'); plt.legend()                        # add ylabel, legend
plt.subplot(2,1,2)                                   # subplot 2
plt.plot(tt,x,'b.',label='x')                        # plot blue dots
plt.plot(tt,y,color='orange',label='y',alpha=0.7)    # plot orange line
plt.xlabel('time'); plt.legend()                      # labels
plt.savefig('04-myFig.png',transparent=True,dpi=600) # save figure
plt.show()                                           # show plot

expert

Plot Activity

Create a plot that displays the data:

xt = [0,0.1,0.2,0.3,0.5,0.8,1.0]
yt = [1.0,2.1,3.5,6.5,7.2,5.9,6.3]

idea

Scatter Plot

Scatter plots are similar to regular plots but they show individuals points instead of values connected in series. Matplotlib and Plotly are used in this example. Matplotlib is fast and simple while Plotly has features for interactive plots.

In [ ]:
# matplotlib
plt.scatter(x,y)
plt.show()
In [ ]:
# plotly
fig = px.scatter(data,x='x',y='y',color='w',size='x',hover_data=['w'])
fig.show()

expert

Scatter Plot Activity

Create a scatter plot with matplotlib or plotly that displays xt paired with yt and zt:

xt = np.array([0,0.1,0.2,0.3,0.5,0.8,1.0])
yt = np.array([1.0,2.1,3.5,6.5,7.2,5.9,6.3])
zt = xt*yt

Change the shape of the points to a square for yt and a triangle for zt. Add a label to indicate which points are yt and zt.

In [ ]:
 

idea

Bar Chart

Bar charts show a histogram distribution of count in a bin range. The alpha option is the transparency between 0 and 1. A value of 0.7 is a good value to use to show the overlying and underlying data.

In [ ]:
bins = np.linspace(-3,3,31)
plt.hist(y,bins,label='y')
plt.hist(z,bins,alpha=0.7,label='z')
plt.legend()
plt.show()

expert

Bar Plot Activity

Create a bar plot that displays the distribution of xt, yt, and zt:

nt = 1000
xt = np.random.rand(nt)
yt = np.random.normal(0,1,nt)
zt = xt*yt

Use bins = np.linspace(-3,3,31) to create the histogram distrubtion.

In [ ]:
 

idea

Pair Plot

A pair plot shows the correlation between variables. It has bar distributions on the diagonal and scatter plots on the off-diagonal. A pair plot also shows a different color (hue) by category w. Pair plots show correlations between pairs of variables that may be related and gives a good indication of features (explanatory inputs) that are used for classification or regression.

In [ ]:
sns.pairplot(data[['x','y','z','w']],hue=('w'))
plt.show()

expert

Pair Plot Activity

Create a pair plot that displays the correlation between xt, yt, and zt between the first 500 and second 500 random numbers that are categorized as Dist. Create a pandas dataframe with:

nt = 100
xt = np.random.rand(nt)
yt = np.random.normal(0,1,nt)
zt = xt*yt
dt = pd.DataFrame(np.column_stack([xt,yt,zt]),columns=['xt','yt','zt'])
dt['Dist'] = 'First'
for i in range(int(nt/2),nt):
    dt.at[i,'Dist'] = 'Second'
In [ ]:
 

idea

Box Plot

A box plot shows data quartiles. In this case, we are comparing the first 500 points with the last 500 points.

In [ ]:
sns.boxplot(x='w',y='x',data=data)
plt.show()

expert

Box Plot Activity

Create a box plot that shows the quartiles of yt by first and second sets as indicated in Dist.

In [ ]:
 

idea

Violin Plot

A voilin plot combines the box plot quartiles with the distribution.

In [ ]:
sns.violinplot(x='w',y='x',data=data,size=6)
plt.show()

expert

Violin Plot Activity

Create a violin plot that shows the quartiles and distribution of zt by first and second sets as indicated in Dist in the DataFrame dt.

In [ ]:
 

idea

Joint Plot

A joint plot shows two variables, with the univariate and joint distributions. Try kind='reg', 'kde', and 'hex' to see different joint plot styles.

In [ ]:
sns.jointplot('x','z',data=data,kind="kde")
plt.show()

expert

Joint Plot Activity

Create a joint plot that shows the joint distribution of yt and zt in the DataFrame dt.

In [ ]:
 

TCLab Activity

expert

Generate or Retrieve Data

connections

A sample data file loads if you do not have a TCLab connected. Otherwise, generate a file from the TCLab data with seconds (t), heater levels (Q1 and Q2), and temperatures (lab.T1 and lab.T2). Record data every second for 120 seconds and change the heater levels every 30 seconds to a random number between 0 and 80 with np.random.randint(). There is no need to change this program, only run it to collect the data over 2 minutes.

In [ ]:
import tclab, time, csv
import numpy as np
try:
    n = 120 
    with open('04-tclab.csv',mode='w',newline='') as f:
        cw = csv.writer(f)
        cw.writerow(['Time','Q1','Q2','T1','T2'])
        with tclab.TCLab() as lab:
            print('t Q1 Q2 T1    T2')
            for t in range(n):
                if t%30==0:
                    Q1 = np.random.randint(0,81)
                    Q2 = np.random.randint(0,81)
                    lab.Q1(Q1); lab.Q2(Q2)
                cw.writerow([t,Q1,Q2,lab.T1,lab.T2])
                if t%5==0:
                    print(t,Q1,Q2,lab.T1,lab.T2)
                time.sleep(1)
    data4=pd.read_csv('04-tclab.csv')
except:
    print('Connect TCLab to generate data')
    url = 'http://apmonitor.com/do/uploads/Main/tclab_dyn_data2.txt'
    data4=pd.read_csv(url)
    data4.columns = ['Time','Q1','Q2','T1','T2']
    
data4.head()    

Graphical Analysis

Analyze Q1, Q2, T1, and T2 graphically with a time series plot and a pair plot. The time series plot should show Q1 and Q2 in the upper subplot and T1 and T2 in the lower subplot. The pair plot should be a 2x2 plot grid that shows the heater / temperature pairs as Q1/T1, Q2/T2.

In [ ]: