Much of data science and machine learning work is getting clean data into the correct form. This may include data cleansing to remove outliers or bad information, scaling for machine learning algorithms, splitting into train and test sets, and enumeration of string data. All of this needs to happen before regression, classification, or other model training. Fortunately, there are functions that help with automating data preparation.
Run the following cell to generate the sample data that is corrupted with NaN (not a number) and outliers that are corrupted data points far outside of the expected trend.
import numpy as np
import pandas as pd
np.random.seed(1)
n = 100
tt = np.linspace(0,n-1,n)
x = np.random.rand(n)+10+np.sqrt(tt)
y = np.random.normal(10,x*0.01,n)
x[1] = np.nan; y[2] = np.nan # 2 NaN (not a number)
for i in range(3): # add 3 outliers (bad data)
ri = np.random.randint(0,n)
x[ri] += np.random.rand()*100
data = pd.DataFrame(np.vstack((tt,x,y)).T,\
columns=['time','x','y'])
data.head()
The outliers are shown on a semi-logy plot. The NaN
values do not show on the plot and are missing points.
import matplotlib.pyplot as plt
%matplotlib inline
plt.semilogy(tt,x,'r.',label='x')
plt.semilogy(tt,y,'b.',label='y')
plt.legend(); plt.xlabel('time')
plt.text(50,60,'Outliers')
plt.show()
NaN values are removed with numpy
by identifying rows ix
that contain NaN
. Next, the rows are removed with z=z[~iz]
where ~
is a bitwise not
operator.
z = np.array([[ 1, 2],
[ np.nan, 3],
[ 4, np.nan],
[ 5, 6]])
iz = np.any(np.isnan(z), axis=1)
print(~iz)
z = z[~iz]
print(z)
The method dropna
is a command to drop NaN
rows in a pandas
DataFrame
. Rows 1 and 2 are dropped.
# drop any row with bad (NaN) values
data = data.dropna()
data.head()
There are several graphical techniques to help detect outliers. A box or histogram plot shows the 3 outlying points.
plt.boxplot(data['x'])
plt.show()
A Grubbs test or other statistical measures can detect outliers. The Grubbs test, in particular, assumes univariate, normally distributed data and is intended to detect only a single outlier. In practice, many outliers be eliminated by removing points that violate a change limit or upper / lower bounds. The statement data[data['x']<30]
keeps the rows where x is less than 30.
data = data[data['x']<30]
plt.boxplot(data['x'])
plt.show()
Without looking at a clock, run this cell to record 1 second intervals for 10 seconds. When you run the cell, press Enter
everytime you think 1 second has passed. After you collect the data, use a boxplot to identify any data points in tsec
that are outliers.
import time
from IPython.display import clear_output
tsec = []
input('Press "Enter" to record 1 second intervals'); t = time.time()
for i in range(10):
clear_output(); input('Press "Enter": ' + str(i+1))
tsec.append(time.time()-t); t = time.time()
clear_output(); print('Completed. Add boxplot to identify outliers')
# Add a boxplot to identify outliers
The sklearn
package has a preprocessing
module to implement common scaling methods. The StandardScalar
is shown below where each column is normalized to zero mean and a standard deviation of one. The common scaling methods fit_transform(X)
to fit and transform, transform(X)
to transform based on another fit, and inverse_transform(Xs)
to scale back to the original representation.
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
ds = s.fit_transform(data)
print(ds[0:5]) # print 5 rows
The value ds
is returned as a numpy
array so we need to convert it back to a pandas
DataFrame
, re-using the column names from data
.
ds = pd.DataFrame(ds,columns=data.columns)
ds.head()
Data is divided into train and test sets to separate a fraction of the rows for evaluating classification or regression models. A typical split is 80% for training and 20% for testing, although the range depends on how much data is available and the objective of the study.
divide = int(len(ds)*0.8)
train = ds[0:divide]
test = ds[divide:]
print(len(train),len(test))
The train_test_split
is a function in sklearn
for the specific purpose of splitting data into train and test sets. There are options such as shuffle=True
to randomize the selection in each set.
from sklearn.model_selection import train_test_split
train,test = train_test_split(ds, test_size=0.2, shuffle=True)
print(len(train),len(test))
Generate a new data file with some randomly inserted bad data (3 minutes) or read the data file from an online link with the following code.
import tclab, time, csv
import numpy as np
try:
with tclab.TCLab() as lab:
with open('05-tclab.csv',mode='w',newline='') as f:
cw = csv.writer(f)
cw.writerow(['Time','Q1','Q2','T1','T2'])
print('t Q1 Q2 T1 T2')
for t in range(180):
T1 = lab.T1; T2 = lab.T2
# insert bad values
bad = np.random.randint(0,30)
T1=np.nan if bad==10 else T1
T2=np.nan if bad==15 else T2
# insert random number (possibly outlier)
outlier = np.random.randint(-40,150)
T1=outlier if bad==20 else T1
T2=outlier if bad==25 else T2
# change heater
if t%30==0:
Q1 = np.random.randint(0,81)
Q2 = np.random.randint(0,81)
lab.Q1(Q1); lab.Q2(Q2)
cw.writerow([t,Q1,Q2,T1,T2])
if t%10==0:
print(t,Q1,Q2,T1,T2)
time.sleep(1)
data5=pd.read_csv('05-tclab.csv')
except:
print('Connect TCLab to generate new data')
print('Importing data from online source')
url = 'http://apmonitor.com/do/uploads/Main/tclab_bad_data.txt'
data5=pd.read_csv(url)
After generating and importing data5
above, remove any rows with NaN
values or outliers in the T1
or T2
columns. Scale the data with a StandardScalar
in scikit
. Divide the data into train (80%) and test (20%) sets.