Data Visualization and Exploration

Data visualization and exploration is one of the first steps in machine learning after the data is gathered and statistically summarized. It is used to graphically represent data to qualitatively understand relationships and data quality.

Visualization is the graphical representation of data while exploration is gaining an understanding of the data to make better informed decisions about how it can be applied. Visualization and exploration gives an understanding of data diversity, relationships, missing data, bad data, or other factors that may influence decisions to exclude or include an appropriate subset for training.

We will explore the Python packages that are commonly used for data visualization and exploration including pandas, plotly, seaborn, and matplotlib.

You may need to install Python packages from the terminal, Anaconda prompt, command prompt, or from the Jupyter Notebook.

pip install pandas matplotlib plotly seaborn

Jupyter Notebooks automatically create a scroll bar if the output is above a certain threshold. Increase this limit by running this javascript code in a Jupyter Notebook code cell.

%%javascript
IPython.OutputArea.auto_scroll_threshold = 9999

This introductory exercise uses simulated data from the Temperature Control Lab (TCLab).

Warning: This script requires 5 hours to run because it is collecting 100 steady state points from the TCLab that require 3 minutes each. Use the generated data at https://apmonitor.com/pds/uploads/Main/TCLab_ss1.txt to avoid this wait time.

import numpy as np
import tclab
import time

filename='TCLab_ss1.txt'
fid = open(filename,'w')
fid.write('Q1,Q2,T1,T2\n')
fid.close()

# Connect to Arduino
a = tclab.TCLabModel()

# random heater values
Q1d = np.random.randint(0,70,size=100)
Q2d = np.random.randint(0,80,size=100)

# collect 100 steady state points (~3 minutes each)
print('Wait 180 seconds between heater points')
print('Full data generation requires 5 hrs!')
for i in range(100):
    # set heater values
    a.Q1(Q1d[i])
    a.Q2(Q2d[i])
    # wait 300 seconds
    time.sleep(300)
    print('Set: ' + str(i) + \
          ' Q1: ' + str(Q1d[i]) + \
          ' Q2: ' + str(Q2d[i]) + \
          ' T1: ' + str(a.T1)   + \
          ' T2: ' + str(a.T2))
    fid = open(filename,'a')
    fid.write(str(Q1d[i])+','+str(Q2d[i])+',' \
              +str(a.T1)+','+str(a.T2)+'\n')
    fid.close()
# close connection to Arduino
a.close()

The final activity gives you an opportunity to try the exploration and visualization approaches with photovoltaic (PV) simulated data from NREL PVWatts.


Pandas

Pandas visualizes and manipulates data tables. There are many functions that allow efficient manipulation for the preliminary steps of data analysis problems. Run the code below to read in the TCLab data file as a DataFrame data. The data.head() command shows the top rows of the table.

import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/TCLab_ss1.txt')
print(data.head())

This prints the top rows of the table with the top 5 rows. You can also see the end with data.tail() or change the number of rows with data.head(10).

             Q1         Q2     T1     T2
   0  46.458549  65.722695  34.01  31.21
   1  52.920782  31.783877  40.30  31.11
   2  30.273413  14.655389  36.85  29.34
   3  97.817672  50.730076  49.23  31.28
   4  94.648879  91.025338  50.48  35.14

The data.describe() command shows summary statistics.

print(data.describe())

This produces basic summary statistics.

                  Q1          Q2          T1          T2
   count  120.000000  120.000000  120.000000  120.000000
   mean    47.889806   50.990620   39.455833   33.456500
   std     26.697499   29.894677    6.911408    3.598129
   min      3.103358    0.333330   26.860000   24.510000
   25%     24.454193   28.370305   34.185000   31.020000
   50%     43.829722   52.226625   37.400000   33.545000
   75%     73.543570   79.391947   44.842500   35.845000
   max     98.912891   99.991342   53.870000   42.650000

The data.plot() creates a plot with all of the data columns.

data.plot()

The optional parameter kind is the type of plot to produce.

  • line : line plot (default)
  • bar : vertical bar plot
  • barh : horizontal bar plot
  • hist : histogram
  • box : boxplot
  • kde / density : Kernel Density Estimation plot
  • area : area plot
  • pie : pie plot
  • scatter : scatter plot
  • hexbin : hexbin plot
data.plot(kind='density',subplots=True,layout=(2,2),figsize=(10,6))
data.plot(kind='box', subplots=True, figsize=(12,3))

Matplotlib

The package matplotlib generates plots in Python. Run the code to create a basic scatter plot.

import matplotlib.pyplot as plt
plt.scatter(data['Q1'],data['T1'])

# add labels and title
plt.xlabel('Heater (%)')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

Plotly

Packages such as plotly and bokeh render interactive plots with HTML and JavaScript. Plotly Express is a high-level and easy-to-use interface to Plotly.

import plotly.express as px
fig = px.scatter(data, x="Q1", y="T1")
fig.show()

Seaborn

Seaborn is built on matplotlib, and produces detailed plots in few lines of code. Run the code below to see an example with the TCLab data.

import seaborn as sns
sns.pairplot(data)
import seaborn as sns
sns.heatmap(data.corr())

Activity

Explore data from PVWatts for BYU South Campus. PVWatts is a package from NREL to estimate the energy production and cost of energy of grid-connected photovoltaic (PV) energy systems. Investigate the specific data columns as factors listed below from all of the potential data factors.

import pandas as pd
df = pd.read_csv('http://apmonitor.com/pds/uploads/Main/PV_BYU_South.txt')
factors=['Ambient Temperature (C)',
         'Wind Speed (m/s)',
         'Plane of Array Irradiance (W/m^2)',
         'Cell Temperature (C)',
         'DC Array Output (W)']

Answer the following questions:

  • What factors are highly correlated with DC Array Output (W)?
  • What factors are highly correlated with Cell Temperature (C)?
  • PV cells are more efficient at lower temperatures. Does the data also show this effect? Why or why not?

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

df = pd.read_csv('http://apmonitor.com/pds/uploads/Main/PV_BYU_South.txt')
factors=['Ambient Temperature (C)',
         'Wind Speed (m/s)',
         'Plane of Array Irradiance (W/m^2)',
         'Cell Temperature (C)',
         'DC Array Output (W)']
print(df.columns)
data = df[factors].copy() # take only subset of data columns
# remove rows where there is no sunlight
data = data[data['Plane of Array Irradiance (W/m^2)']>0.01]
# calculate efficiency (use PV Cell m^2 to get true efficiency)
data['efficiency'] = data['DC Array Output (W)'] \
                      /data['Plane of Array Irradiance (W/m^2)']
print(data.head())
print(data.describe())
data.plot()

fig = px.scatter(data, x="Ambient Temperature (C)", \
                       y="Cell Temperature (C)")
fig.show()
sns.pairplot(data)

plt.figure()
x = data['Ambient Temperature (C)']
y = data['Cell Temperature (C)']
plt.scatter(x,y)
plt.xlabel('Ambient Temperature (°C)')
plt.ylabel('Cell Temperature (°C)')
plt.show()

Further Reading


✅ Knowledge Check

1. What Python package is commonly used for data visualization and exploration and is built on top of matplotlib?

A. seaborn
Correct. Seaborn is a Python package used for data visualization and is built on top of matplotlib.
B. pandas
Incorrect. While pandas is used for data manipulation and visualization, it is not built on top of matplotlib.
C. numpy
Incorrect. Numpy is primarily used for numerical operations and is not a data visualization package.
D. plotly
Incorrect. Plotly is a separate library used for creating interactive plots and is not built on matplotlib.

2. Which command is used to print the top rows of a DataFrame in pandas?

A. data.start()
Incorrect. There is no command named 'data.start()' in pandas.
B. data.show()
Incorrect. The correct command in pandas to show the top rows of a DataFrame is 'data.head()'.
C. data.head()
Correct. The 'data.head()' command is used in pandas to display the top rows of a DataFrame.
D. data.display()
Incorrect. There is no command named 'data.display()' in pandas.