Data Cleansing
Measurements from sensors or from human input can contain bad data that negatively affects machine learning. This tutorial demonstrates how to identify and remove bad data.
Data cleansing is the process of removing bad data that may include outliers, missing entries, failed sensors, or other types of missing or corrupted information.
Bad data can be detected with summary statistics and data visualization. An effective way to remove bad data is with filters that segregate based on conditions that remove outliers, replace bad values such as NaN (Not a Number), or remove the data row that contains an NaN value.
Remove Bad Data with Numpy
NaN values are removed with numpy by identifying rows ix that contain NaN. Next, the rows are removed with z=z[~iz] where ~ is a bitwise not operator.
z = np.array([[ 1, 2],
[ np.nan, 3],
[ 4, np.nan],
[ 5, 6]])
iz = np.any(np.isnan(z), axis=1)
print(~iz)
z = z[~iz]
print(z)
The result of the selection iz is True if NaN is found anywhere in the row of False if NaN is not found in that row. The ~ is the not operator to reverse True and False.
print(~iz) > [ True False False True]
Only the rows with True are kept in the final z result.
print(z) > [[1. 2.] > [5. 6.]]
Remove Bad Data with Pandas
Pandas manipulates data tables called DataFrames. There are many functions to efficiently manipulate data and remove bad data.
import pandas as pd
z = pd.DataFrame({'x':[1,np.nan,4,5],'y':[2,3,np.nan,6]})
print(z)
This produces the same values as shown with the Numpy example.
x y 0 1.0 2.0 1 NaN 3.0 2 4.0 NaN 3 5.0 6.0
There are two common ways to deal with NaN values: drop the rows or fill in values. The first is to remove the rows with dropna. Use inplace=True to avoid the extra assignment of result=z.dropna() to modify z directly.
x y 0 1.0 2.0 3 5.0 6.0
Replace Bad Data with Pandas
If data is very limited then it may be better to keep the row and fill-in values with methods interpolate for time-series data or fillna to replace NaN with a default value.
A common replacement is the mean value for each column with z.mean(). In this case, the mean (average) of column x is 3.333 and column y is 3.667.
x y 0 1.000000 2.000000 1 3.333333 3.000000 2 4.000000 3.666667 3 5.000000 6.000000
Filters with Pandas
Data visualization can help identify outliers, especially with box plots. Statistical information such as standard deviation can also help to identify outliers such as eliminating data points that are more than 5 standard deviations away from the mean. A conditional filter can be created to eliminate bad values. The conditional statement z['y']<5.5 creates a Logical array of True for z['y']<5.5 and False for z['y']>=5.5.
This filter eliminates the last row of the DataFrame and the NaN in that column.
x y 0 1.0 2.0 1 NaN 3.0
Filters can be combined with the and bitwise operator & or the or bitwise operator |.
Another way to combine filters is to operate on the object with successive methods such as z['x'].notnull() to eliminate NaN values in the x column, the .fillna(0) to replace NaN with zero in the y column, and .reset_index(drop=True) to reset the DataFrame index.
x y 0 1.0 2.0 1 4.0 0.0 2 5.0 6.0
Further Reading
- Brownlee, J., How to Remove Outliers for Machine Learning, April 2018.
✅ Knowledge Check
1. What is the primary purpose of data cleansing?
- Incorrect. Data cleansing aims to remove bad data, not add it.
- Correct. Data cleansing is about removing bad data such as outliers, missing entries, and corrupted information.
- Incorrect. While data cleansing can reduce storage needs by removing unnecessary data, its primary purpose is to improve data quality for analysis.
- Incorrect. While clean data can lead to clearer visualizations, the main goal of data cleansing is to improve data quality for analysis.
2. How can NaN values be removed from a Numpy array?
- Incorrect. The dropna() method is associated with Pandas, not Numpy.
- Incorrect. While this method is used to replace NaN values, it doesn't remove them.
- Correct. This method uses bitwise not operator (~) and a selection to remove rows containing NaN values from a Numpy array.
- Incorrect. The fillna() method is associated with Pandas, and it replaces NaN values rather than removing them.