Pandas Time-Series

Time-series data is generated sequentially with repeated measurements that typically occur at regular intervals. The pandas package collects and analyzes time-series data in a DataFrame.

Pandas reads stored files or data tables from a database. It produces a DataFrame as a table with headers and each column as a Series. The data source can either be local

# local source
pd.read_csv('data.csv')

[$[Get Code]]

or from an online source such as a URL.

# online source
pd.read_csv('http://apmonitor.com/dde/uploads/Main/tclab.txt')

[$[Get Code]]

There are additional read options in Pandas such as from the Clipboard, Excel, Feather, Flat file, Google BigQuery, HTML, JSON, HDFStore, Pickled Files, Parquet, ORC, SAS, SPSS, SQL Database, and STATA. There are also options to write a DataFrame to a CSV file or to the following data structures or online repositories.

# write DataFrame dx
dx.to_clipboard()
dx.to_csv()
dx.to_dict()
dx.to_excel()
dx.to_feather()
dx.to_gbq()
dx.to_hdf()
dx.to_html()
dx.to_json()
dx.to_latex()
dx.to_markdown()
dx.to_parquet()
dx.to_pickle()
dx.to_records()
dx.to_sql()
dx.to_stata()
dx.to_string()

[$[Get Code]]

Join Data

For time series data, the tables are joined to match features and labels at particular time points.

Generate DataFrames dx and dy

import numpy as np
import pandas as pd

tx = np.linspace(0,1,4); x = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})

ty = np.linspace(0,1,3); y = np.sin(ty)
dy = pd.DataFrame({'Time':ty,'y':y})

[$[Get Code]]

          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.666667  0.785887
   3  1.000000  0.540302

          Time         y
   0       0.0  0.000000
   1       0.5  0.479426
   2       1.0  0.841471

The two dataframes are joined with an index ('Time') that is set as the key for comparing and merging.

dx = dx.set_index('Time')
dy = dy.set_index('Time')

[$[Get Code]]

Options for merging are to join only the entries of dy that match dx.

z = dx.join(dy)

[$[Get Code]]

All missing entries have NaN for Not a Number. The default is a left join meaning that DataFrame z has the same index as x.

                    x         y
   Time                        
   0.000000  1.000000  0.000000
   0.333333  0.944957       NaN
   0.666667  0.785887       NaN
   1.000000  0.540302  0.841471

Right Join

A right join results in the index of y.

z = dx.join(dy,how='right')

[$[Get Code]]

                x         y
   Time                    
   0.0   1.000000  0.000000
   0.5        NaN  0.479426
   1.0   0.540302  0.841471

Inner Join

z = dx.join(dy,how='inner')

[$[Get Code]]

An inner join gives only the rows with an index that is common to both DataFrames.

                x         y
   Time                    
   0.0   1.000000  0.000000
   1.0   0.540302  0.841471

Outer Join

z = dx.join(dy,how='outer',sort=True)

[$[Get Code]]

An outer join with sort=True gives all of the entries from both DataFrames.

                    x         y
   Time                        
   0.000000  1.000000  0.000000
   0.333333  0.944957       NaN
   0.500000       NaN  0.479426
   0.666667  0.785887       NaN
   1.000000  0.540302  0.841471

Append Data

Multiple data sets can be combined with the .append() function. This is needed to combine multiple data sets with the same types of data but stored separately.

import numpy as np
import pandas as pd
tx = np.linspace(0,1,4); x = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})
tx = np.linspace(0,1,3)
x = np.cos(tx)
dy = pd.DataFrame({'Time':tx,'x':x})

[$[Get Code]]

In this case, both dx and dy have the same data but at different times.

          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.666667  0.785887
   3  1.000000  0.540302

          Time         x
       0   0.0  1.000000
       1   0.5  0.877583
       2   1.0  0.540302

The functions append, sort_values, drop_duplicates, and reset_index are used to append, sort, consolidate, and reset the index of the table.

dx.append(dy)\
.sort_values(by='Time')\
.drop_duplicates(subset='Time')\
.reset_index(drop=True)

[$[Get Code]]

          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.500000  0.877583
   3  0.666667  0.785887
   4  1.000000  0.540302

Activity

Time-Series Analysis with Pandas

Google Colab

Complete two exercises with the Temperature Control Lab TCLab or TCLab Model (digital twin) TCLabModel. Collect data on time, heater 1 (Q1), heater 2 (Q2), temperature 1 (T1), and temperature 2 (T2).

1. Import stored data as a Pandas DataFrame.

TCLab Data

2. Collect streaming data: add a data row to a Pandas DataFrame each second for 60 seconds. Change heater 1 (Q1) from 0 to 70% output at 5 seconds.

For each, generate a plot that shows the measured temperature and heater values on separate subplots. Add appropriate labels to the plots such as x-label and legend. Compute basic statistics such as Minimum, Maximum, Average, and Standard Deviation values. Show a table that summarizes the statistics.

Solution

import tclab
import time
import pandas as pd
import matplotlib.pyplot as plt

x = {'time':[],'Q1':[],'Q2':[],'T1':[],'T2':[]}
df = pd.DataFrame(x)

with tclab.TCLabModel() as lab:
for i in range(61):
Q1 = 0 if i<=5 else 70
lab.Q1(Q1); Q2=0
df.loc[i] = [i,Q1,Q2,lab.T1,lab.T2]
time.sleep(1)

df.set_index('time',inplace=True)

print(df.describe())
df.to_csv('tclab.csv')

df.plot(subplots=True)
plt.show()

[$[Get Code]]

import numpy as np
import tclab
import time
import pandas as pd

speedup = 10
TCLab = tclab.setup(connected=False, speedup=speedup)

x = {'time':[],'Q1':[],'Q2':[],'T1':[],'T2':[]}
df = pd.DataFrame(x)

with TCLab() as lab:
for i in range(61):
start = time.time()
Q1 = 0 if i<=5 else 70
lab.Q1(Q1); Q2=0
df.loc[i] = [i,Q1,Q2,lab.T1,lab.T2]

end = time.time()
# sleep for 0.1 sec
time.sleep(max(0.01,1/speedup-(end-start)))

df.set_index('time',inplace=True)

print(df.describe())
df.to_csv('tclab.csv')

df.plot(subplots=True)
plt.show()

[$[Get Code]]

✅ Knowledge Check

1. Which statement best describes the purpose of the pd.read_csv() function in pandas?

A. It writes the data in a DataFrame to a CSV file.

Incorrect. The pd.read_csv() function is used to read data from a CSV file into a pandas DataFrame, not write data to a CSV file.

B. It reads data from a CSV file into a pandas DataFrame.

Correct. The pd.read_csv() function is indeed used to read data from a CSV file into a pandas DataFrame.

C. It merges two pandas DataFrames based on a common key.

Incorrect. Merging of DataFrames is done using methods like join() and not pd.read_csv().

D. It sorts the rows in a DataFrame based on a specified column.

Incorrect. Sorting of rows in a DataFrame based on a specified column is done using the sort_values() method, not pd.read_csv().

2. After joining two dataframes dx and dy with an outer join and the sort=True argument, which of the following is true about the resultant dataframe?

A. Only the entries common to both dataframes will be present.

Incorrect. An outer join includes all entries from both DataFrames, not just the common ones.

B. The resultant dataframe will have an index from dx.

Incorrect. An outer join with sort=True gives all of the entries from both DataFrames and sorts them.

C. The resultant dataframe will contain all entries from both dx and dy.

Correct. An outer join includes all entries from both dx and dy.

D. NaN values will be replaced by zeros.

Incorrect. The join operation does not replace NaN values with zeros. NaN values indicate missing entries.

Data-Driven Engineering