Summary Statistics

Summary statistics give valuable insights as one of the first steps in data engineering after the data is gathered. Statistics help to assess data quality and diversity. Data discovery with statistics is a common first activity and there are many excellent packages to help with the standard analysis.

We will explore the Python packages that are commonly used for statistical analysis and data exploration including numpy, and pandas.

You may need to install Python packages from the terminal, Anaconda prompt, command prompt, or from the Jupyter Notebook.

pip install numpy pandas

Pandas

Pandas imports data, generates summary statistics, and manipulates data tables. There are many functions that allow efficient manipulation for the preliminary steps of data analysis problems. Run the code below to read in the smallpox data file as a DataFrame data. The data.head() command shows the top rows of the table.

import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/smallpox.txt')
data.head()

This prints the top rows of the table with the top 5 rows. You can also see the end with data.tail() or change the number of rows with data.head(10).

        week state  state_name   disease  cases  incidence_per_capita
   0  192801    AL     ALABAMA  SMALLPOX      1                  0.04
   1  192801    AR    ARKANSAS  SMALLPOX      7                  0.38
   2  192801    AZ     ARIZONA  SMALLPOX      0                  0.00
   3  192801    CA  CALIFORNIA  SMALLPOX     18                  0.34
   4  192801    CO    COLORADO  SMALLPOX     31                  3.06

The data.describe() command shows summary statistics.

data.describe()

This produces basic summary statistics.

                   week         cases  incidence_per_capita
   count   50916.000000  50916.000000          50916.000000
   mean   193809.850636      4.572787              0.249108
   std       591.489888     15.062277              0.824331
   min    192801.000000      0.000000              0.000000
   25%    193312.000000      0.000000              0.000000
   50%    193819.000000      0.000000              0.000000
   75%    194324.000000      2.000000              0.090000
   max    195250.000000    350.000000             50.360000

Pandas Profiling

Pandas Profiling (ydata-profiling) is a data analysis tool for a more in-depth summary of the data than the describe() function. Install the package with

  pip install ipywidgets ydata-profiling

You need to restart the kernel before proceeding. The install only needs to run once.

from ydata_profiling import ProfileReport
profile = ProfileReport(data, explorative=True, minimal=False)

After you load ProfileReport and create a new profile to analyze the data. Some of the functions take a long time with a large data set. Two methods for dealing with large data sets are to:

  • Sub-sample the data sets such as with data = data[::10] to take every 10th row.
  • Use the minimal=True option to avoid analysis that is slow with large data sets.

View the profile report in the Jupyter Notebook with profile.to_widgets() or export to html file with profile.to_file("analysis.html").

profile.to_file("analysis.html")

Activity

This activity uses summary statistics to analyze disease spread in the US states with smallpox data. The introductory exercise analyzes data specific to the state of Utah.

Summary statistics are created to analyze another US state besides the state of Utah. Basic mathematical operations are used to create trends that describe the disease spread.

Further Reading

Solutions


✅ Knowledge Check

1. What is one of the primary uses of summary statistics in data engineering?

A. To rewrite and recreate data from scratch
B. To assess data quality and diversity after gathering
C. To modify the structure of the database
D. Only for graphical representation of data

2. Which Python library mentioned is primarily used for generating summary statistics and manipulating data tables?

A. numpy
B. Anaconda
C. pandas
D. Jupyter Notebook
Streaming Chatbot
💬