3. Analyze Data

Python Data Science

Once data is read into Python, a first step is to analyze the data with summary statistics. This is especially true if the data set is large. Summary statistics include the count, mean, standard deviation, maximum, minimum, and quartile information for the data columns.

idea

Generate Data

Run the next cell to:

  • Generate n linearly spaced values betweeen 0 and n-1 with np.linspace(start,end,count)
  • Draw random samples from a uniform distribution between 0 and 1 with np.random.rand(count)
  • Draw random samples from a normal (Gaussian) distribution with np.random.normal(mean,std,count)
  • Combine time, x, and y with a vertical stack np.vstack and transpose .T for column oriented data.
  • Save CSV text file 03-data.csv with header time,x,y.
In [ ]:
import numpy as np
np.random.seed(0)
n = 1000
time = np.linspace(0,n-1,n)
x = np.random.rand(n)
y = np.random.normal(1,1,n)
data = np.vstack((time,x,y)).T
np.savetxt('03-data.csv',data,header='time,x,y',delimiter=',',comments='')

idea

Display Data Distributions

The histogram is a preview of how to create graphics so that data can be evaluated visually. 04. Visualize shows how to create plots to analyze data.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(x,10,label='x')
plt.hist(y,60,label='y',alpha=0.7)
plt.ylabel('Count'); plt.legend()
plt.show()

idea

Data Analysis with numpy

The np.loadtxt function reads the CSV data file 03-data.csv. Numpy calculates size (dimensions), mean (average), std (standard deviation), and median as summary statistics. If you don't specify the axis then numpy gives a statistic across both the rows (axis=0) and columns (axis=1).

In [ ]:
import numpy as np
data = np.loadtxt('03-data.csv',delimiter=',',skiprows=1)

print('Dimension (rows,columns):')
print(np.size(data,0),np.size(data,1))

print('Average:')
print(np.mean(data,axis=0))

print('Standard Deviation:')
print(np.std(data,0))

print('Median:')
print(np.median(data,0))

expert

Analyze data

  1. Calculate the mean, standard deviation, and median of x*y
  2. Calculate the skew of x*y with the scipy.stats skew function.
In [ ]:
 

idea

Data Analysis with pandas

Pandas simplifies the data analysis with the .describe() function that is a method of DataFrame that is created with pd.read_csv(). Note that the data file can either be a local file name or a web-address such as

url='https://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'
data = pd.read_csv(url)
data.describe()
In [ ]:
import pandas as pd
data = pd.read_csv('03-data.csv')
data.describe()

expert

TCLab Activity

connections

Generate Data Set 1

Generate a file from the TCLab data with seconds (t), heater levels (Q1 and Q2), and temperatures (lab.T1 and lab.T2). Record data every second for 120 seconds and change the heater levels every 30 seconds to a random number between 0 and 80 with np.random.randint(). There is no need to change this program, only run it for 2 minutes to collect the data. If you do not have a TCLab device, read a data file 1 from an online link.

In [ ]:
import tclab, time, csv
import pandas as pd
import numpy as np
try:
    # connect to TCLab if available
    n = 120 
    with open('03-tclab1.csv',mode='w',newline='') as f:
        cw = csv.writer(f)
        cw.writerow(['Time','Q1','Q2','T1','T2'])
        with tclab.TCLab() as lab:
            print('t Q1 Q2 T1    T2')
            for t in range(n):
                if t%30==0:
                    Q1 = np.random.randint(0,81)
                    Q2 = np.random.randint(0,81)
                    lab.Q1(Q1); lab.Q2(Q2)
                cw.writerow([t,Q1,Q2,lab.T1,lab.T2])
                if t%5==0:
                    print(t,Q1,Q2,lab.T1,lab.T2)
                time.sleep(1)
    file = '03-tclab1.csv'
    data1=pd.read_csv(file)
except:
    print('No TCLab device found, reading online file')
    url = 'http://apmonitor.com/do/uploads/Main/tclab_dyn_data2.txt'
    data1=pd.read_csv(url)

Read Data Set 2

Use requests to download a sample TCLab data file for the analysis. It is saved as 03-tclab2.csv.

In [ ]:
import requests
import os
url = 'http://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'
r = requests.get(url)
with open('03-tclab2.csv', 'wb') as f:
    f.write(r.content)
    
print('File 03-tclab2.csv retrieved to current working directory: ')
print(os.getcwd())

Data Analysis

Read the files 03-tclab1.csv and 03-tclab2.csv and display summary statistics for each with data.describe(). Use the summary statistics to compare the number of samples and differences in average and standard deviation value for T1 and T2.

In [ ]: