Summary Statistics

Summary statistics give valuable insights as one of the first steps in data engineering after the data is gathered. Statistics help to assess data quality and diversity. Data discovery with statistics is a common first activity and there are many excellent packages to help with the standard analysis.

We will explore the Python packages that are commonly used for statistical analysis and data exploration including numpy, and pandas.

You may need to install Python packages from the terminal, Anaconda prompt, command prompt, or from the Jupyter Notebook.

pip install numpy pandas

Pandas

Pandas imports data, generates summary statistics, and manipulates data tables. There are many functions that allow efficient manipulation for the preliminary steps of data analysis problems. Run the code below to read in the smallpox data file as a DataFrame data. The data.head() command shows the top rows of the table.

import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/smallpox.txt')
data.head()

This prints the top rows of the table with the top 5 rows. You can also see the end with data.tail() or change the number of rows with data.head(10).

        week state  state_name   disease  cases  incidence_per_capita
   0  192801    AL     ALABAMA  SMALLPOX      1                  0.04
   1  192801    AR    ARKANSAS  SMALLPOX      7                  0.38
   2  192801    AZ     ARIZONA  SMALLPOX      0                  0.00
   3  192801    CA  CALIFORNIA  SMALLPOX     18                  0.34
   4  192801    CO    COLORADO  SMALLPOX     31                  3.06

The data.describe() command shows summary statistics.

data.describe()

This produces basic summary statistics.

                   week         cases  incidence_per_capita
   count   50916.000000  50916.000000          50916.000000
   mean   193809.850636      4.572787              0.249108
   std       591.489888     15.062277              0.824331
   min    192801.000000      0.000000              0.000000
   25%    193312.000000      0.000000              0.000000
   50%    193819.000000      0.000000              0.000000
   75%    194324.000000      2.000000              0.090000
   max    195250.000000    350.000000             50.360000

Pandas Profiling

Pandas Profiling (ydata-profiling) is a data analysis tool for a more in-depth summary of the data than the describe() function. Install the package with

  pip install ipywidgets ydata-profiling

You need to restart the kernel before proceeding. The install only needs to run once.

from ydata_profiling import ProfileReport
profile = ProfileReport(data, explorative=True, minimal=False)

After you load ProfileReport and create a new profile to analyze the data. Some of the functions take a long time with a large data set. Two methods for dealing with large data sets are to:

  • Sub-sample the data sets such as with data = data[::10] to take every 10th row.
  • Use the minimal=True option to avoid analysis that is slow with large data sets.

View the profile report in the Jupyter Notebook with profile.to_widgets() or export to html file with profile.to_file("analysis.html").

profile.to_file("analysis.html")

Activity

This activity uses summary statistics to analyze disease spread in the US states with smallpox data. The introductory exercise analyzes data specific to the state of Utah.

Summary statistics are created to analyze another US state besides the state of Utah. Basic mathematical operations are used to create trends that describe the disease spread.

Further Reading

Solutions


✅ Knowledge Check

1. What is one of the primary uses of summary statistics in data engineering?

A. To rewrite and recreate data from scratch
Incorrect. Summary statistics are not used for rewriting or recreating data.
B. To assess data quality and diversity after gathering
Correct. Summary statistics provide valuable insights as one of the initial steps in data engineering, helping to assess data quality and diversity.
C. To modify the structure of the database
Incorrect. While data manipulation might involve modifying structures, summary statistics are about understanding the data's nature and distribution.
D. Only for graphical representation of data
Incorrect. While graphical representation might utilize summary statistics, they have broader applications.

2. Which Python library mentioned is primarily used for generating summary statistics and manipulating data tables?

A. numpy
Incorrect. While numpy is useful for numerical operations in Python, it's not the primary tool for generating summary statistics on data tables.
B. Anaconda
Incorrect. Anaconda is a distribution of Python and R for scientific computing and data science, not a specific library for summary statistics.
C. pandas
Correct. Pandas is widely used for data manipulation and generation of summary statistics, especially with its DataFrame functionality.
D. Jupyter Notebook
Incorrect. Jupyter Notebook is an interactive computing environment, not a specific library for generating summary statistics.
💬