Pandas Time-Series
Time-series data is generated sequentially with repeated measurements that typically occur at regular intervals. The pandas package collects and analyzes time-series data in a DataFrame.
Pandas reads stored files or data tables from a database. It produces a DataFrame as a table with headers and each column as a Series. The data source can either be local
pd.read_csv('data.csv')
or from an online source such as a URL.
pd.read_csv('http://apmonitor.com/dde/uploads/Main/tclab.txt')
There are additional read options in Pandas such as from the Clipboard, Excel, Feather, Flat file, Google BigQuery, HTML, JSON, HDFStore, Pickled Files, Parquet, ORC, SAS, SPSS, SQL Database, and STATA. There are also options to write a DataFrame to a CSV file or to the following data structures or online repositories.
dx.to_clipboard()
dx.to_csv()
dx.to_dict()
dx.to_excel()
dx.to_feather()
dx.to_gbq()
dx.to_hdf()
dx.to_html()
dx.to_json()
dx.to_latex()
dx.to_markdown()
dx.to_parquet()
dx.to_pickle()
dx.to_records()
dx.to_sql()
dx.to_stata()
dx.to_string()
Join Data
For time series data, the tables are joined to match features and labels at particular time points.
Generate DataFrames dx and dy
import pandas as pd
tx = np.linspace(0,1,4); x = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})
ty = np.linspace(0,1,3); y = np.sin(ty)
dy = pd.DataFrame({'Time':ty,'y':y})
Time x 0 0.000000 1.000000 1 0.333333 0.944957 2 0.666667 0.785887 3 1.000000 0.540302 Time y 0 0.0 0.000000 1 0.5 0.479426 2 1.0 0.841471
The two dataframes are joined with an index ('Time') that is set as the key for comparing and merging.
dy = dy.set_index('Time')
Options for merging are to join only the entries of dy that match dx.
All missing entries have NaN for Not a Number. The default is a left join meaning that DataFrame z has the same index as x.
x y Time 0.000000 1.000000 0.000000 0.333333 0.944957 NaN 0.666667 0.785887 NaN 1.000000 0.540302 0.841471
Right Join
A right join results in the index of y.
x y Time 0.0 1.000000 0.000000 0.5 NaN 0.479426 1.0 0.540302 0.841471
Inner Join
An inner join gives only the rows with an index that is common to both DataFrames.
x y Time 0.0 1.000000 0.000000 1.0 0.540302 0.841471
Outer Join
An outer join with sort=True gives all of the entries from both DataFrames.
x y Time 0.000000 1.000000 0.000000 0.333333 0.944957 NaN 0.500000 NaN 0.479426 0.666667 0.785887 NaN 1.000000 0.540302 0.841471
Append Data
Multiple data sets can be combined with the .append() function. This is needed to combine multiple data sets with the same types of data but stored separately.
import pandas as pd
tx = np.linspace(0,1,4); x = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})
tx = np.linspace(0,1,3)
x = np.cos(tx)
dy = pd.DataFrame({'Time':tx,'x':x})
In this case, both dx and dy have the same data but at different times.
Time x 0 0.000000 1.000000 1 0.333333 0.944957 2 0.666667 0.785887 3 1.000000 0.540302 Time x 0 0.0 1.000000 1 0.5 0.877583 2 1.0 0.540302
The functions append, sort_values, drop_duplicates, and reset_index are used to append, sort, consolidate, and reset the index of the table.
.sort_values(by='Time')\
.drop_duplicates(subset='Time')\
.reset_index(drop=True)
Time x 0 0.000000 1.000000 1 0.333333 0.944957 2 0.500000 0.877583 3 0.666667 0.785887 4 1.000000 0.540302
Activity
Complete two exercises with the Temperature Control Lab TCLab or TCLab Model (digital twin) TCLabModel. Collect data on time, heater 1 (Q1), heater 2 (Q2), temperature 1 (T1), and temperature 2 (T2).
1. Import stored data as a Pandas DataFrame.
2. Collect streaming data: add a data row to a Pandas DataFrame each second for 60 seconds. Change heater 1 (Q1) from 0 to 70% output at 5 seconds.
For each, generate a plot that shows the measured temperature and heater values on separate subplots. Add appropriate labels to the plots such as x-label and legend. Compute basic statistics such as Minimum, Maximum, Average, and Standard Deviation values. Show a table that summarizes the statistics.
Solution
import time
import pandas as pd
import matplotlib.pyplot as plt
x = {'time':[],'Q1':[],'Q2':[],'T1':[],'T2':[]}
df = pd.DataFrame(x)
with tclab.TCLabModel() as lab:
for i in range(61):
Q1 = 0 if i<=5 else 70
lab.Q1(Q1); Q2=0
df.loc[i] = [i,Q1,Q2,lab.T1,lab.T2]
time.sleep(1)
df.set_index('time',inplace=True)
print(df.describe())
df.to_csv('tclab.csv')
df.plot(subplots=True)
plt.show()
import tclab
import time
import pandas as pd
speedup = 10
TCLab = tclab.setup(connected=False, speedup=speedup)
x = {'time':[],'Q1':[],'Q2':[],'T1':[],'T2':[]}
df = pd.DataFrame(x)
with TCLab() as lab:
for i in range(61):
start = time.time()
Q1 = 0 if i<=5 else 70
lab.Q1(Q1); Q2=0
df.loc[i] = [i,Q1,Q2,lab.T1,lab.T2]
end = time.time()
# sleep for 0.1 sec
time.sleep(max(0.01,1/speedup-(end-start)))
df.set_index('time',inplace=True)
print(df.describe())
df.to_csv('tclab.csv')
df.plot(subplots=True)
plt.show()
✅ Knowledge Check
1. Which statement best describes the purpose of the pd.read_csv() function in pandas?
- Incorrect. The pd.read_csv() function is used to read data from a CSV file into a pandas DataFrame, not write data to a CSV file.
- Correct. The pd.read_csv() function is indeed used to read data from a CSV file into a pandas DataFrame.
- Incorrect. Merging of DataFrames is done using methods like join() and not pd.read_csv().
- Incorrect. Sorting of rows in a DataFrame based on a specified column is done using the sort_values() method, not pd.read_csv().
2. After joining two dataframes dx and dy with an outer join and the sort=True argument, which of the following is true about the resultant dataframe?
- Incorrect. An outer join includes all entries from both DataFrames, not just the common ones.
- Incorrect. An outer join with sort=True gives all of the entries from both DataFrames and sorts them.
- Correct. An outer join includes all entries from both dx and dy.
- Incorrect. The join operation does not replace NaN values with zeros. NaN values indicate missing entries.