5. Prepare Data

Python Data Science

Much of data science and machine learning work is getting clean data into the correct form. This may include data cleansing to remove outliers or bad information, scaling for machine learning algorithms, splitting into train and test sets, and enumeration of string data. All of this needs to happen before regression, classification, or other model training. Fortunately, there are functions that help with automating data preparation.

idea

Generate Sample Data

Run the following cell to generate the sample data that is corrupted with NaN (not a number) and outliers that are corrupted data points far outside of the expected trend.

In [ ]:
import numpy as np
import pandas as pd
np.random.seed(1)
n = 100
tt = np.linspace(0,n-1,n)
x = np.random.rand(n)+10+np.sqrt(tt)
y = np.random.normal(10,x*0.01,n)
x[1] = np.nan; y[2] = np.nan  # 2 NaN (not a number)
for i in range(3):            # add 3 outliers (bad data)
    ri = np.random.randint(0,n)
    x[ri] += np.random.rand()*100
data = pd.DataFrame(np.vstack((tt,x,y)).T,\
                    columns=['time','x','y'])
data.head()

analyze

Visualize Data

The outliers are shown on a semi-logy plot. The NaN values do not show on the plot and are missing points.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.semilogy(tt,x,'r.',label='x')
plt.semilogy(tt,y,'b.',label='y')
plt.legend(); plt.xlabel('time')
plt.text(50,60,'Outliers')
plt.show()

idea

Remove Outliers and Bad Data

NaN values are removed with numpy by identifying rows ix that contain NaN. Next, the rows are removed with z=z[~iz] where ~ is a bitwise not operator.

In [ ]:
z = np.array([[      1,      2],
              [ np.nan,      3],
              [      4, np.nan],
              [      5,      6]])
iz = np.any(np.isnan(z), axis=1)
print(~iz)
z = z[~iz]
print(z)

The method dropna is a command to drop NaN rows in a pandas DataFrame. Rows 1 and 2 are dropped.

In [ ]:
# drop any row with bad (NaN) values
data = data.dropna()
data.head()

There are several graphical techniques to help detect outliers. A box or histogram plot shows the 3 outlying points.

In [ ]:
plt.boxplot(data['x'])
plt.show()

A Grubbs test or other statistical measures can detect outliers. The Grubbs test, in particular, assumes univariate, normally distributed data and is intended to detect only a single outlier. In practice, many outliers be eliminated by removing points that violate a change limit or upper / lower bounds. The statement data[data['x']<30] keeps the rows where x is less than 30.

In [ ]:
data = data[data['x']<30]
plt.boxplot(data['x'])
plt.show()

expert

Time Activity

Without looking at a clock, run this cell to record 1 second intervals for 10 seconds. When you run the cell, press Enter everytime you think 1 second has passed. After you collect the data, use a boxplot to identify any data points in tsec that are outliers.

In [ ]:
import time
from IPython.display import clear_output
tsec = []
input('Press "Enter" to record 1 second intervals'); t = time.time()
for i in range(10):
    clear_output(); input('Press "Enter": ' + str(i+1))
    tsec.append(time.time()-t); t = time.time()
clear_output(); print('Completed. Add boxplot to identify outliers')
In [ ]:
# Add a boxplot to identify outliers

idea

Scale Data

The sklearn package has a preprocessing module to implement common scaling methods. The StandardScalar is shown below where each column is normalized to zero mean and a standard deviation of one. The common scaling methods fit_transform(X) to fit and transform, transform(X) to transform based on another fit, and inverse_transform(Xs) to scale back to the original representation.

In [ ]:
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
ds = s.fit_transform(data)
print(ds[0:5]) # print 5 rows

The value ds is returned as a numpy array so we need to convert it back to a pandas DataFrame, re-using the column names from data.

In [ ]:
ds = pd.DataFrame(ds,columns=data.columns)
ds.head()

idea

Divide Data

Data is divided into train and test sets to separate a fraction of the rows for evaluating classification or regression models. A typical split is 80% for training and 20% for testing, although the range depends on how much data is available and the objective of the study.

In [ ]:
divide = int(len(ds)*0.8)
train = ds[0:divide]
test = ds[divide:]
print(len(train),len(test))

The train_test_split is a function in sklearn for the specific purpose of splitting data into train and test sets. There are options such as shuffle=True to randomize the selection in each set.

In [ ]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(ds, test_size=0.2, shuffle=True)
print(len(train),len(test))

TCLab Activity

expert

Data with Bad Values & Outliers

Generate a new data file with some randomly inserted bad data (3 minutes) or read the data file from an online link with the following code.

In [ ]:
import tclab, time, csv
import numpy as np

try:
    with tclab.TCLab() as lab:
        with open('05-tclab.csv',mode='w',newline='') as f:
            cw = csv.writer(f)
            cw.writerow(['Time','Q1','Q2','T1','T2'])
            print('t Q1 Q2 T1    T2')
            for t in range(180):
                T1 = lab.T1; T2 = lab.T2
                # insert bad values
                bad = np.random.randint(0,30)
                T1=np.nan if bad==10 else T1
                T2=np.nan if bad==15 else T2
                # insert random number (possibly outlier)
                outlier = np.random.randint(-40,150)
                T1=outlier if bad==20 else T1
                T2=outlier if bad==25 else T2
                # change heater
                if t%30==0:
                    Q1 = np.random.randint(0,81)
                    Q2 = np.random.randint(0,81)
                    lab.Q1(Q1); lab.Q2(Q2)
                cw.writerow([t,Q1,Q2,T1,T2])
                if t%10==0:
                    print(t,Q1,Q2,T1,T2)
                time.sleep(1)
            data5=pd.read_csv('05-tclab.csv')
except:
    print('Connect TCLab to generate new data')
    print('Importing data from online source')
    url = 'http://apmonitor.com/do/uploads/Main/tclab_bad_data.txt'
    data5=pd.read_csv(url)

Cleanse, Scale, and Divide Data

After generating and importing data5 above, remove any rows with NaN values or outliers in the T1 or T2 columns. Scale the data with a StandardScalar in scikit. Divide the data into train (80%) and test (20%) sets.

In [ ]: