Data Cleansing

Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.

Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.

Bad data can be detected with summary statistics and data visualization. An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as NaN (Not a Number), or remove the data row that contains an NaN value.


Remove Bad Data with Numpy

NaN values are removed with numpy by identifying rows ix that contain NaN. Next, the rows are removed with z=z[~iz] where ~ is a bitwise not operator.

import numpy as np
z = np.array([[      1,      2],
              [ np.nan,      3],
              [      4, np.nan],
              [      5,      6]])
iz = np.any(np.isnan(z), axis=1)
print(~iz)
z = z[~iz]
print(z)

The result of the selection iz is True if NaN is found anywhere in the row of False if NaN is not found in that row. The ~ is the not operator to reverse True and False.

   print(~iz)

   > [ True False False  True]

Only the rows with True are kept in the final z result.

   print(z)

   > [[1. 2.]
   >  [5. 6.]]

Remove Bad Data with Pandas

Pandas manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.

import numpy as np
import pandas as pd
z = pd.DataFrame({'x':[1,np.nan,4,5],'y':[2,3,np.nan,6]})
print(z)

This produces the same values as shown with the Numpy example.

        x    y
   0  1.0  2.0
   1  NaN  3.0
   2  4.0  NaN
   3  5.0  6.0

There are two common ways to deal with NaN values: drop the rows or fill in values. The first is to remove the rows with dropna. Use inplace=True to avoid the extra assignment of result=z.dropna() to modify z directly.

result = z.dropna()
        x    y
   0  1.0  2.0
   3  5.0  6.0

Replace Bad Data with Pandas

If data is very limited then it may be better to keep the row and fill-in values with methods interpolate for time-series data or fillna to replace NaN with a default value.

result = z.fillna(z.mean())

A common replacement is the mean value for each column with z.mean(). In this case, the mean (average) of column x is 3.333 and column y is 3.667.

             x         y
   0  1.000000  2.000000
   1  3.333333  3.000000
   2  4.000000  3.666667
   3  5.000000  6.000000

Filters with Pandas

Data visualization can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement z['y']<5.5 creates a Logical array of True for z['y']<5.5 and False for z['y']>=5.5.

result = z[z['y']<5.5]

This filter eliminates the last row of the DataFrame and the NaN in that column.

        x    y
   0  1.0  2.0
   1  NaN  3.0

Filters can be combined with the and bitwise operator & or the or bitwise operator |.

result = z[(z['y']<5.5) & (z['y']>=1.0)]

Another way to combine filters is to operate on the object with successive methods such as z['x'].notnull() to eliminate NaN values in the x column, the .fillna(0) to replace NaN with zero in the y column, and .reset_index(drop=True) to reset the DataFrame index.

result = z[z['x'].notnull()].fillna(0).reset_index(drop=True)
        x    y
   0  1.0  2.0
   1  4.0  0.0
   2  5.0  6.0

Further Reading


✅ Knowledge Check

1. What is the primary purpose of data cleansing?

A. To add bad data into machine learning models.
Incorrect. Data cleansing aims to remove bad data, not add it.
B. To identify and remove bad data.
Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information.
C. To improve the efficiency of data storage.
Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis.
D. To make data visualization more attractive.
Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis.

2. How can NaN values be removed from a Numpy array?

A. By using the dropna() method.
Incorrect. The dropna() method is associated with Pandas, not Numpy.
B. By replacing NaN values with the mean of the array.
Incorrect. While this method is used to replace NaN values, it doesn't remove them.
C. By using the command: z=z[~iz] where iz is the selection of NaN rows.
Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array.
D. By using the fillna() method with the parameter 0.
Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them.