Data Gathering and Consolidation

Gathering data is the process of consolidating disparate data (Excel spreadsheet, CSV file, PDF report, database, cloud storage) into a single repository. Data may reside in local storage or obtained through a remote connection to a database, web service, or extracted from a webpage.

Generate a data table dx and save it as a Comma Separated Values (CSV) file to be used as an example.

Generate Sample Data dx.csv

import numpy as np
import pandas as pd
tx = np.linspace(0,1,8); x  = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})
dx.to_csv('dx.csv',index=False)
print(dx)
Timex
0.0000001.000000
0.1428570.989813
0.2857140.959461
0.4285710.909560
0.5714290.841129
0.7142860.755561
0.8571430.654600
1.0000000.540302

Pandas

An easy way to read a data file is with Pandas. It produces a DataFrame that is a table with headers. The data source can either be local

# local source
pd.read_csv('dx.csv')

or from an online source such as a URL.

# online source
pd.read_csv('http://apmonitor.com/pds/uploads/Main/dx.txt')

Pandas can also read tables from a webpage such as the data table from this page.

url = 'http://apmonitor.com/pds/index.php/Main/GatherData'
table = pd.read_html(url)
print(table)
   [          0         1
    0      Time         x
    1  0.000000  1.000000
    2  0.142857  0.989813
    3  0.285714  0.959461
    4  0.428571  0.909560
    5  0.571429  0.841129
    6  0.714286  0.755561
    7  0.857143  0.654600
    8  1.000000  0.540302]

If there is more than one table on a page then Pandas reads each one as a list of DataFrames. There are additional read options in Pandas such as from the Clipboard, Excel, Feather, Flat file, Google BigQuery, HTML, JSON, HDFStore, Pickled Files, Parquet, ORC, SAS, SPSS, SQL Database, and STATA. There are also options to write a DataFrame to a CSV file or to the following data structures or online repositories.

# write DataFrame dx
dx.to_clipboard()
dx.to_csv()
dx.to_dict()
dx.to_excel()
dx.to_feather()
dx.to_gbq()
dx.to_hdf()
dx.to_html()
dx.to_json()
dx.to_latex()
dx.to_markdown()
dx.to_parquet()
dx.to_pickle()
dx.to_records()
dx.to_sql()
dx.to_stata()
dx.to_string()

Numpy

There are Numpy functions for loading data files. The function genfromtxt replaces any non-numeric values (such as a header) with nan (not a number).

d = np.genfromtxt('dx.csv',delimiter=',')
print(d)
   [[       nan        nan]
    [0.         1.        ]
    [0.14285714 0.98981326]
    [0.28571429 0.95946058]
    [0.42857143 0.90956035]
    [0.57142857 0.84112921]
    [0.71428571 0.75556135]
    [0.85714286 0.65460007]
    [1.         0.54030231]]

The function loadtxt skips a certain number of rows that may contain non-numeric values.

d = np.loadtxt('dx.csv',delimiter=',',skiprows=1)
print(d)
   [[0.         1.        ]
    [0.14285714 0.98981326]
    [0.28571429 0.95946058]
    [0.42857143 0.90956035]
    [0.57142857 0.84112921]
    [0.71428571 0.75556135]
    [0.85714286 0.65460007]
    [1.         0.54030231]]

Numpy functions are fast but limited to only numeric data that can fit into a standard Numpy data structure such as an array.


Open and Read File

Sometimes more control is needed to process complex data files. The open() function returns a file object f. This script opens the file dx.csv, prints each line, and closes the file.

f = open('dx.csv', 'r')
for x in f:
  print(x)
f.close()
   Time,x
   0.0,1.0
   0.14285714285714285,0.9898132604466151
   0.2857142857142857,0.9594605811119173
   0.42857142857142855,0.9095603516741667
   0.5714285714285714,0.8411292134152362
   0.7142857142857142,0.7555613467006966
   0.8571428571428571,0.6546000666752676
   1.0,0.5403023058681398

The read() function returns the contents of the entire file. A number such as read(4) returns only the first 4 characters.

f = open('dx.csv', 'r')
x = f.read(4); print(x)
f.close()
   Time

The readline() function returns the next line of the file.

f = open('dx.csv', 'r')
print(f.readline())
print(f.readline())
print(f.readline())
f.close()
   Time,x
   0.0,1.0
   0.14285714285714285,0.9898132604466151

Beautiful Soup

Beautiful Soup is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser and functions for iterating, searching, and modifying the parse tree.

First, get the html source from a webpage such as this page.

import requests
url = 'http://apmonitor.com/pds/index.php/Main/GatherData?action=print'
page = requests.get(url)

The attribute page.content contains the html source if page_status_code starts with a 2 such as 200 (downloaded successfully). A 4 or 5 indicates an error. BeautifulSoup parses HTML or XML files.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

Functions such as print(soup.prettify()) can be used to view the structured output or the page title

print(soup.title.text)
  Data Gathering and Consolidation

All of the links are extracted:

for link in soup.find_all('a'):
    print('Link Text: {}'.format(link.text))
    print('href: {}'.format(link.get('href')))
   Link Text: Data Gathering and Consolidation
   href: https://apmonitor.com/pds/index.php/Main/GatherData
   Link Text: Data File
   href: https://apmonitor.com/pds/uploads/Main/dx.txt

Pandas uses BeautifulSoup to extract tables from webpages. Data scraping is particularly useful for getting information from webpages that are updated with new information such as weather, stock data, and customer reviews. More advanced web scraping packages are Scrapy and Selenium.

Join Data

For time series data, the tables are joined to match features and labels at particular time points.

Generate DataFrames dx and dy

import numpy as np
import pandas as pd

tx = np.linspace(0,1,4); x  = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})

ty = np.linspace(0,1,3); y  = np.sin(ty)
dy = pd.DataFrame({'Time':ty,'y':y})
          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.666667  0.785887
   3  1.000000  0.540302

          Time         y
   0       0.0  0.000000
   1       0.5  0.479426
   2       1.0  0.841471

The two dataframes are joined with an index ('Time') that is set as the key for comparing and merging.

dx = dx.set_index('Time')
dy = dy.set_index('Time')

Options for merging are to join only the entries of dy that match dx.

z = dx.join(dy)

All missing entries have NaN for Not a Number. The default is a left join meaning that DataFrame z has the same index as x.

                    x         y
   Time                        
   0.000000  1.000000  0.000000
   0.333333  0.944957       NaN
   0.666667  0.785887       NaN
   1.000000  0.540302  0.841471

Right Join

A right join results in the index of y.

z = dx.join(dy,how='right')
                x         y
   Time                    
   0.0   1.000000  0.000000
   0.5        NaN  0.479426
   1.0   0.540302  0.841471

Inner Join

z = dx.join(dy,how='inner')

An inner join gives only the rows with an index that is common to both DataFrames.

                x         y
   Time                    
   0.0   1.000000  0.000000
   1.0   0.540302  0.841471

Outer Join

z = dx.join(dy,how='outer',sort=True)

An outer join with sort=True gives all of the entries from both DataFrames.

                    x         y
   Time                        
   0.000000  1.000000  0.000000
   0.333333  0.944957       NaN
   0.500000       NaN  0.479426
   0.666667  0.785887       NaN
   1.000000  0.540302  0.841471

Append Data

Multiple data sets can be combined with the .append() function. This is needed to combine multiple data sets with the same types of data but stored separately.

import numpy as np
import pandas as pd
tx = np.linspace(0,1,4); x  = np.cos(tx)
dx = pd.DataFrame({'Time':tx,'x':x})
tx = np.linspace(0,1,3)
x  = np.cos(tx)
dy = pd.DataFrame({'Time':tx,'x':x})

In this case, both dx and dy have the same data but at different times.

          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.666667  0.785887
   3  1.000000  0.540302

          Time         x
       0   0.0  1.000000
       1   0.5  0.877583
       2   1.0  0.540302

The functions append, sort_values, drop_duplicates, and reset_index are used to append, sort, consolidate, and reset the index of the table.

dx.append(dy)\
  .sort_values(by='Time')\
  .drop_duplicates(subset='Time')\
  .reset_index(drop=True)
          Time         x
   0  0.000000  1.000000
   1  0.333333  0.944957
   2  0.500000  0.877583
   3  0.666667  0.785887
   4  1.000000  0.540302

Activity

Import Google Stock Data and display the table.

Calculate summary statistics such as the mean, median, standard deviation, and quartile information. Create two new data columns for Volatility=High-Low and Change=Close-Open.

Solution

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# stock ticker symbol
url = 'http://apmonitor.com/pds/uploads/Main/goog.txt'

# import data with pandas
data = pd.read_csv(url)
print(data.describe())

# calculate change and volatility
data['Change'] = data['Close']-data['Open']
data['Volatility'] = data['High']-data['Low']
analysis = ['Open','Volume','Volatility','Change']
sns.heatmap(data[analysis].corr())
plt.show()

✅ Knowledge Assessment

1. Which of the following best describes the process of data gathering?

A. Storing data
Incorrect. Data gathering involves actively collecting data, not just storing it.
B. Deleting redundant data
Incorrect. While it's essential to manage redundant data, data gathering itself focuses on the collection and consolidation of data.
C. Actively collecting data from various sources
Correct. Data gathering involves actively collecting data from different sources for analysis or decision-making.
D. Analyzing existing data
Incorrect. Analysis is a subsequent step after data gathering. Data gathering focuses on the collection process.

2. What does data consolidation involve?

A. Making multiple copies of the same data
Incorrect. Data consolidation is about bringing data together, not creating multiple copies of it.
B. Bringing together data from different sources into a single view
Correct. Data consolidation means combining data from various sources to present it in a unified manner or view.
C. Only collecting data without any intention of using it
Incorrect. Data consolidation implies the intention to use combined data, not merely collecting without a purpose.
D. Analyzing and interpreting data patterns
Incorrect. While analysis may follow consolidation, the consolidation itself is about combining data.