Summary Statistics

Summary statistics give valuable insights as one of the first steps in data engineering after the data is gathered. Statistics help to assess data quality and diversity. Data discovery with statistics is a common first activity and there are many excellent packages to help with the standard analysis.

We will explore the Python packages that are commonly used for statistical analysis and data exploration including numpy, and pandas.

Summary Statistics Notebook

Jupyter Notebook in Google Colab

You may need to install Python packages from the terminal, Anaconda prompt, command prompt, or from the Jupyter Notebook.

pip install numpy pandas

[$[Get Code]]

Pandas

Pandas imports data, generates summary statistics, and manipulates data tables. There are many functions that allow efficient manipulation for the preliminary steps of data analysis problems. Run the code below to read in the smallpox data file as a DataFrame data. The data.head() command shows the top rows of the table.

import pandas as pd
data = pd.read_csv('http://apmonitor.com/pds/uploads/Main/smallpox.txt')
data.head()

[$[Get Code]]

This prints the top rows of the table with the top 5 rows. You can also see the end with data.tail() or change the number of rows with data.head(10).

        week state  state_name   disease  cases  incidence_per_capita
   0  192801    AL     ALABAMA  SMALLPOX      1                  0.04
   1  192801    AR    ARKANSAS  SMALLPOX      7                  0.38
   2  192801    AZ     ARIZONA  SMALLPOX      0                  0.00
   3  192801    CA  CALIFORNIA  SMALLPOX     18                  0.34
   4  192801    CO    COLORADO  SMALLPOX     31                  3.06

The data.describe() command shows summary statistics.

data.describe()

[$[Get Code]]

This produces basic summary statistics.

                   week         cases  incidence_per_capita
   count   50916.000000  50916.000000          50916.000000
   mean   193809.850636      4.572787              0.249108
   std       591.489888     15.062277              0.824331
   min    192801.000000      0.000000              0.000000
   25%    193312.000000      0.000000              0.000000
   50%    193819.000000      0.000000              0.000000
   75%    194324.000000      2.000000              0.090000
   max    195250.000000    350.000000             50.360000

Pandas Profiling

Pandas Profiling (ydata-profiling) is a data analysis tool for a more in-depth summary of the data than the describe() function. Install the package with

  pip install ipywidgets ydata-profiling

You need to restart the kernel before proceeding. The install only needs to run once.

from ydata_profiling import ProfileReport
profile = ProfileReport(data, explorative=True, minimal=False)

[$[Get Code]]

After you load ProfileReport and create a new profile to analyze the data. Some of the functions take a long time with a large data set. Two methods for dealing with large data sets are to:

Sub-sample the data sets such as with data = data[::10] to take every 10th row.
Use the minimal=True option to avoid analysis that is slow with large data sets.

View the profile report in the Jupyter Notebook with profile.to_widgets() or export to html file with profile.to_file("analysis.html").

profile.to_file("analysis.html")

[$[Get Code]]

Activity

This activity uses summary statistics to analyze disease spread in the US states with smallpox data. The introductory exercise analyzes data specific to the state of Utah.

Math and Statistics Review Notebook

Jupyter Notebook in Google Colab

Summary statistics are created to analyze another US state besides the state of Utah. Basic mathematical operations are used to create trends that describe the disease spread.

Solutions

Math and Statistics Solution in Python

Google Colab

Math and Statistics Solution in MATLAB

✅ Knowledge Check

1. What is one of the primary uses of summary statistics in data engineering?

A. To rewrite and recreate data from scratch

Incorrect. Summary statistics are not used for rewriting or recreating data.

B. To assess data quality and diversity after gathering

Correct. Summary statistics provide valuable insights as one of the initial steps in data engineering, helping to assess data quality and diversity.

C. To modify the structure of the database

Incorrect. While data manipulation might involve modifying structures, summary statistics are about understanding the data's nature and distribution.

D. Only for graphical representation of data

Incorrect. While graphical representation might utilize summary statistics, they have broader applications.

2. Which Python library mentioned is primarily used for generating summary statistics and manipulating data tables?

A. numpy

Incorrect. While numpy is useful for numerical operations in Python, it's not the primary tool for generating summary statistics on data tables.

B. Anaconda

Incorrect. Anaconda is a distribution of Python and R for scientific computing and data science, not a specific library for summary statistics.

C. pandas

Correct. Pandas is widely used for data manipulation and generation of summary statistics, especially with its DataFrame functionality.

D. Jupyter Notebook

Incorrect. Jupyter Notebook is an interactive computing environment, not a specific library for generating summary statistics.

Machine Learning for Engineers