Once data is read into Python, a first step is to analyze the data with summary statistics. This is especially true if the data set is large. Summary statistics include the count, mean, standard deviation, maximum, minimum, and quartile information for the data columns.
Run the next cell to:
n
linearly spaced values betweeen 0
and n-1
with np.linspace(start,end,count)
np.random.rand(count)
np.random.normal(mean,std,count)
time
, x
, and y
with a vertical stack np.vstack
and transpose .T
for column oriented data.03-data.csv
with header time,x,y
.import numpy as np
np.random.seed(0)
n = 1000
time = np.linspace(0,n-1,n)
x = np.random.rand(n)
y = np.random.normal(1,1,n)
data = np.vstack((time,x,y)).T
np.savetxt('03-data.csv',data,header='time,x,y',delimiter=',',comments='')
The histogram is a preview of how to create graphics so that data can be evaluated visually. 04. Visualize shows how to create plots to analyze data.
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(x,10,label='x')
plt.hist(y,60,label='y',alpha=0.7)
plt.ylabel('Count'); plt.legend()
plt.show()
numpy
¶The np.loadtxt
function reads the CSV data file 03-data.csv
. Numpy calculates size
(dimensions), mean
(average), std
(standard deviation), and median
as summary statistics. If you don't specify the axis
then numpy
gives a statistic across both the rows (axis=0
) and columns (axis=1
).
import numpy as np
data = np.loadtxt('03-data.csv',delimiter=',',skiprows=1)
print('Dimension (rows,columns):')
print(np.size(data,0),np.size(data,1))
print('Average:')
print(np.mean(data,axis=0))
print('Standard Deviation:')
print(np.std(data,0))
print('Median:')
print(np.median(data,0))
x*y
skew
of x*y
with the scipy.stats
skew function.
pandas
¶Pandas simplifies the data analysis with the .describe()
function that is a method of DataFrame
that is created with pd.read_csv()
. Note that the data file can either be a local file name or a web-address such as
url='https://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'
data = pd.read_csv(url)
data.describe()
import pandas as pd
data = pd.read_csv('03-data.csv')
data.describe()
Generate a file from the TCLab data with seconds (t
), heater levels (Q1
and Q2
), and temperatures (lab.T1
and lab.T2
). Record data every second for 120 seconds and change the heater levels every 30 seconds to a random number between 0 and 80 with np.random.randint()
. There is no need to change this program, only run it for 2 minutes to collect the data. If you do not have a TCLab device, read a data file 1 from an online link.
import tclab, time, csv
import pandas as pd
import numpy as np
try:
# connect to TCLab if available
n = 120
with open('03-tclab1.csv',mode='w',newline='') as f:
cw = csv.writer(f)
cw.writerow(['Time','Q1','Q2','T1','T2'])
with tclab.TCLab() as lab:
print('t Q1 Q2 T1 T2')
for t in range(n):
if t%30==0:
Q1 = np.random.randint(0,81)
Q2 = np.random.randint(0,81)
lab.Q1(Q1); lab.Q2(Q2)
cw.writerow([t,Q1,Q2,lab.T1,lab.T2])
if t%5==0:
print(t,Q1,Q2,lab.T1,lab.T2)
time.sleep(1)
file = '03-tclab1.csv'
data1=pd.read_csv(file)
except:
print('No TCLab device found, reading online file')
url = 'http://apmonitor.com/do/uploads/Main/tclab_dyn_data2.txt'
data1=pd.read_csv(url)
Use requests
to download a sample TCLab data file for the analysis. It is saved as 03-tclab2.csv
.
import requests
import os
url = 'http://apmonitor.com/pdc/uploads/Main/tclab_data2.txt'
r = requests.get(url)
with open('03-tclab2.csv', 'wb') as f:
f.write(r.content)
print('File 03-tclab2.csv retrieved to current working directory: ')
print(os.getcwd())
Read the files 03-tclab1.csv
and 03-tclab2.csv
and display summary statistics for each with data.describe()
. Use the summary statistics to compare the number of samples and differences in average and standard deviation value for T1
and T2
.