Text Data Analysis

Importing and analyzing text data is common in data-driven engineering. Text files are often in the form of tables with a header row that describes each data column. Each row contains values as either numbers or strings that are separated by tabs, commas, or another type of delimiter. This tutorial is a demonstration of text data analysis with an application to vehicle performance analysis.

Import Libraries

Text files can be imported with the Python open() command, numpy.loadtxt() or pandas.read_csv(). The pandas library uses numpy as a base package and is a fully functional library for importing and analyzing data. This tutorial demonstrates the use of pandas for automotive data import and analysis.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Import Comma Separated Value (CSV) File

A 2021 Ford Explorer was driven and data recorded from the OBD-II port. The engine data was written to a CSV file and uploaded to the cloud for retrieval and analysis.

The data file is stored in a zipped archive to compress the file for more efficient data storage and transfer. Pandas downloads and unzips the data file from the URL. The first two rows contain descriptive information about the data collection and are skipped with skiprows=2.

file = 'http://apmonitor.com/dde/uploads/Main/Auto_Ford2021.zip'
data = pd.read_csv(file,skiprows=2)

Summarize Data

Summary statistics provide insight, especially for large datasets. The count, mean, standard deviation, minimum, quartile, and maximum values are provided for each data column.

data.describe()

Display First 10 Rows

The first rows show that some of the sensors require 2-20 seconds to initialize. This creates data columns of 0 values or -1 values. For example, Bearing (deg) is 0-360° but is -1 initially until the sensor value is available. The data.head() function reports the first 5 rows by default. Other useful functions are data.tail() and data.sample() to report the end or randomly selected rows, respectively.

data.head(10)

Data Cleansing

It is important to remove bad data before analysis. Many machine learning methods are heavily influenced by outliers or bad data and may lead to the wrong conclusion. The first item is to remove a leading space in the column names.

# trim column names (extra leading space)
c = list(data.columns)
print('Before: ',c[0:3])
for i,ci in enumerate(c):
    c[i] = ci.strip()
print('After: ',c[0:3])
data.columns = c

Conditional filters keep only those data rows that are True. The following data filter is to keep only the data rows with data['Bearing (deg)']>=0.

# filter data
# remove rows with Bearing (deg)=-1 (sensors are initializing)
data = data[data['Bearing (deg)']>=0]

In addition to removing the first rows as the data is initialized. It may be desirable to remove the last rows if the vehicle is stopped.

# remove data (remove last 5 rows)
data = data.iloc[:-5]

Data Reduction

Data reduction is important during data exploration phases for quick analysis and to resize the data set if only a subset is needed. The syntax data[::10] returns every 10th row from the original DataFrame.

# keep every 10th row
data = data[::10]

Set Time as Index

The table may have a key column that can be used to slice and reference the table by a unique row identifier. In this case, the Time (sec) is used as the DataFrame index.

# set time index
data.set_index('Time (sec)',inplace=True)

Add Column

Additional Series (columns) can be added to the DataFrame such as with custom calculations that are derived from other columns. In this case, the Average fuel economy (MPG) is calculated.

data['Avg fuel economy (MPG)'] = data['Trip Distance (miles)']/data['Trip Fuel (gal)']

Visualize Select Data

The data.plot() function creates a time series plot of the data. Adding subplot=True creates a separate plot for each trend. Only a subset of the data columns is included in the plot.

c = ['Vehicle speed (MPH)','Fuel rate (gal/hr)',
     'Absolute throttle position (%)',
     'Acceleration (ft/s²)','Altitude (ft)',
     'Avg fuel economy (MPG)']
data[c].plot(figsize=(10,7),subplots=True)
plt.tight_layout()
plt.show()

View Data Correlation

The Seaborn package creates custom visualizations from Pandas DataFrames. The sns.pairplot() is a scatter plot of each pair of data series in a matrix form. The plots on the diagonal show the distribution of that single variable and the off-diagonal plots show the correlation between data series.

import seaborn as sns
sns.pairplot(data[c])

Display GPS Route

Plotly is a package for interactive data visualization. Install plotly with pip and restart the kernel, if not already installed.

#pip install plotly

A map box is created to show the vehicle route with color as the speed (MPH) and size of the dot as Fuel rate (gal/hr).

import plotly.express as px
fig = px.scatter_mapbox(data, lat="Latitude (deg)",\
                        lon="Longitude (deg)", \
                        color="Vehicle speed (MPH)", \
                        size="Fuel rate (gal/hr)", \
                        color_continuous_scale= \
                           px.colors.cyclical.IceFire, \
                        size_max=5, zoom=7)
fig.update_layout(
    mapbox_style="open-street-map",
    margin={"r": 0, "t": 0, "l": 0, "b": 0},
)
fig.show()

Export Modified Text File

data.to_csv('export_data.csv')

Activity

A 2021 Chrysler Pacifica is driven in Iowa. Compare the Ford Explorer and Chrysler Pacifica with the following:

  • Calculate average fuel economy for both vehicles
  • Include both vehicles on a pairplot
  • Create a map of the Chrysler Pacifica route
file = 'http://apmonitor.com/dde/uploads/Main/Auto_Chrysler2021.zip'
dch = pd.read_csv(file,skiprows=2)

Solution

Additional Data


✅ Knowledge Check

1. To import text data from a CSV file, which library and function is recommended?

A. Python open() command
B. numpy.loadtxt()
C. pandas.read_csv()
D. matplotlib.pyplot

2. After importing data, what is the primary reason to perform data cleansing?

A. To add new columns to the data
B. To change the visual representation of the data
C. To remove outliers or bad data
D. To reduce the size of the dataset for faster processing
Streaming Chatbot
💬