Data Cleansing

Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.

Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.

Data Cleansing Python Jupyter Notebook

Jupyter Notebook in Google Colab

Data Cleansing MATLAB Live Script

Bad data can be detected with summary statistics and data visualization. An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as NaN (Not a Number), or remove the data row that contains an NaN value.

Remove Bad Data with Numpy

NaN values are removed with numpy by identifying rows ix that contain NaN. Next, the rows are removed with z=z[~iz] where ~ is a bitwise not operator.

import numpy as np
z = np.array([[ 1, 2],
[ np.nan, 3],
[ 4, np.nan],
[ 5, 6]])
iz = np.any(np.isnan(z), axis=1)
print(~iz)
z = z[~iz]
print(z)

[$[Get Code]]

The result of the selection iz is True if NaN is found anywhere in the row of False if NaN is not found in that row. The ~ is the not operator to reverse True and False.

   print(~iz)

   > [ True False False  True]

Only the rows with True are kept in the final z result.

   print(z)

   > [[1. 2.]
   >  [5. 6.]]

Remove Bad Data with Pandas

Pandas manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.

import numpy as np
import pandas as pd
z = pd.DataFrame({'x':[1,np.nan,4,5],'y':[2,3,np.nan,6]})
print(z)

[$[Get Code]]

This produces the same values as shown with the Numpy example.

        x    y
   0  1.0  2.0
   1  NaN  3.0
   2  4.0  NaN
   3  5.0  6.0

There are two common ways to deal with NaN values: drop the rows or fill in values. The first is to remove the rows with dropna. Use inplace=True to avoid the extra assignment of result=z.dropna() to modify z directly.

result = z.dropna()

[$[Get Code]]

        x    y
   0  1.0  2.0
   3  5.0  6.0

Replace Bad Data with Pandas

If data is very limited then it may be better to keep the row and fill-in values with methods interpolate for time-series data or fillna to replace NaN with a default value.

result = z.fillna(z.mean())

[$[Get Code]]

A common replacement is the mean value for each column with z.mean(). In this case, the mean (average) of column x is 3.333 and column y is 3.667.

             x         y
   0  1.000000  2.000000
   1  3.333333  3.000000
   2  4.000000  3.666667
   3  5.000000  6.000000

Filters with Pandas

Data visualization can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement z['y']<5.5 creates a Logical array of True for z['y']<5.5 and False for z['y']>=5.5.

result = z[z['y']<5.5]

[$[Get Code]]

This filter eliminates the last row of the DataFrame and the NaN in that column.

        x    y
   0  1.0  2.0
   1  NaN  3.0

Filters can be combined with the and bitwise operator & or the or bitwise operator |.

result = z[(z['y']<5.5) & (z['y']>=1.0)]

[$[Get Code]]

Another way to combine filters is to operate on the object with successive methods such as z['x'].notnull() to eliminate NaN values in the x column, the .fillna(0) to replace NaN with zero in the y column, and .reset_index(drop=True) to reset the DataFrame index.

result = z[z['x'].notnull()].fillna(0).reset_index(drop=True)

[$[Get Code]]

        x    y
   0  1.0  2.0
   1  4.0  0.0
   2  5.0  6.0

✅ Knowledge Check

1. What is the primary purpose of data cleansing?

A. To add bad data into machine learning models.

B. To identify and remove bad data.

C. To improve the efficiency of data storage.

D. To make data visualization more attractive.

2. How can NaN values be removed from a Numpy array?

A. By using the dropna() method.

B. By replacing NaN values with the mean of the array.

C. By using the command: z=z[~iz] where iz is the selection of NaN rows.

D. By using the fillna() method with the parameter 0.

Machine Learning for Engineers

Data Cleansing

Remove Bad Data with Numpy

Remove Bad Data with Pandas

Replace Bad Data with Pandas

Filters with Pandas

Further Reading

✅ Knowledge Check

Search

Options: