Web Scraping

Internet data is a rich source of information. Much of the online information is designed for web-browsers and viewed by humans. Web scraping is data retrieval and curation of online information by a computer program. Scraping automates tedious manual retrieval of information and can be used to watch for updates. The exercises in this section demonstrate how to retrieve data from a website such as an image and a table.

📷 Download Image

Download python_web_scrape.png from this webpage using the urllib library.

import urllib.request

# download image
img = 'python_web_scrape.png'
url = 'http://apmonitor.com/dde/uploads/Main/'+img
urllib.request.urlretrieve(url, img)

[$[Get Code]]

Packages for image manipulation and computer vision include Pillow (Python Imaging Library), Scikit-image, Matplotlib (uses Pillow functions), and OpenCV. OpenCV is the most capable computer vision package and is supported in many development environments. Display the image with Matplotlib.

import matplotlib.pyplot as plt
im = plt.imread(img)
plt.imshow(im)

[$[Get Code]]

🔢 Read Table with Pandas

The Pandas Time-Series exercise has a sample exercise that demonstrates how to read a text file either from a local directory or from a URL address. However, suppose data is online as an HTML table.

Read the table into Python with Pandas read_html() function. This function returns any tables on a webpage as a list. Use [0] to retrieve the first table.

import pandas as pd
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping'
data = pd.read_html(url)[0]
data

[$[Get Code]]

The table can be modified as a DataFrame, such as setting the index as time.

data.set_index('time',inplace=True)
data

[$[Get Code]]

time	Q1	T1	T2
0.0	0.0	20.9495	20.9495
5.0	0.0	20.9495	20.9495
10.0	70.0	20.9495	20.9495
15.0	70.0	21.5941	20.9495
20.0	70.0	22.2387	20.9495
25.0	70.0	22.8833	20.9495
30.0	70.0	23.8502	20.9495
35.0	70.0	25.1394	21.2718
40.0	70.0	26.1063	21.2718
45.0	70.0	27.0732	21.5941
50.0	70.0	28.3624	21.5941
55.0	70.0	29.3293	21.5941
60.0	70.0	30.6185	21.9164

🔢 Read Table with requests

Many websites are designed to block program (bot) access to avoid Distributed Denial of Service (DDoS) attacks that can overwhelm a web service. Before sending many requests, it is important to check with the website owner to not overload the service. Read the table with requests to include a header that emulates a browser.

import requests
# look like a browser
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) "+
"AppleWebKit/537.36 (KHTML, like Gecko) "+
"Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
data = pd.read_html(r.text)[0]
data.set_index('time',inplace=True)
data

[$[Get Code]]

🥣 Beautiful Soup

Beautiful Soup is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser and functions for iterating, searching, and modifying the parse tree. First, get the html source from a webpage such as this page.

import requests
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping?action=print'
page = requests.get(url)

[$[Get Code]]

The attribute page.content contains the html source if page_status_code starts with a 2 such as 200 (downloaded successfully). A 4 or 5 indicates an error. BeautifulSoup parses HTML or XML files.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

[$[Get Code]]

Functions such as print(soup.prettify()) can be used to view the structured output or the page title

print(soup.title.text)

[$[Get Code]]

  Web Scraping

All of the links are extracted:

for link in soup.find_all('a'):
print('Link Text: {}'.format(link.text))
print('href: {}'.format(link.get('href')))

[$[Get Code]]

Pandas uses BeautifulSoup to extract tables from webpages. Data scraping is particularly useful for getting information from webpages that are updated with new information such as weather, stock data, and customer reviews. More advanced web scraping packages are MechanicalSoup, Scrapy, and Selenium.

✅ Activity

Practice web scraping to retrieve data from another website of interest that contains a table. Organize the content into a DataFrame and export the DataFrame to a CSV file. Below is an example of retrieval from the Wikipedia article on Data Tables where the data table is saved as test.csv. Change the url, use requests with a browser header if necessary, and export the data file.

import pandas as pd
import requests
url = 'https://en.wikipedia.org/wiki/Table_(information)'
req = requests.get(url)
data = pd.read_html(req.content)[0]
data.to_csv('test.csv')
data

[$[Get Code]]

import pandas as pd
import os
try:
os.mkdir('./tables')
except:
print('Directory already exists')
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup'
req = requests.get(url)
tables = pd.read_html(req.content)
for i,t in enumerate(tables):
print(f'----- Table {i} ----------------------------')
print(t.head(3))
t.to_csv('./tables/table_'+str(i)+'.csv')

[$[Get Code]]

✅ Knowledge Check

1. Which package is primarily used to extract (scrape) information from web pages by iterating, searching, and modifying the parse tree?

A. Pandas

Incorrect. While Pandas can be used to extract tables from web pages, it's not primarily used for web scraping and modifying the parse tree.

B. Matplotlib

Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.

C. BeautifulSoup

Correct. BeautifulSoup is a Python package primarily used for extracting information from web pages and offers functions to iterate, search, and modify the parse tree.

D. OpenCV

Incorrect. OpenCV is primarily used for computer vision and not for web scraping.

2. Which of the following statements regarding web scraping is accurate?

A. The read_html() function in Pandas requires the URL to be passed directly for web scraping.

Incorrect. The read_html() function in Pandas can accept both URLs and HTML content directly.

B. All websites allow bot access for web scraping.

Incorrect. Many websites block program (bot) access to avoid overloads or misuse. Always check with the website's terms and conditions before scraping.

C. BeautifulSoup requires an HTML or XML parser to function.

Correct. BeautifulSoup uses an HTML or XML parser to function and provides various functions to iterate, search, and modify the parse tree.

D. Matplotlib is a web scraping package that can be used to retrieve tables from web pages.

Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.

Data-Driven Engineering

Web Scraping

📷 Download Image

🔢 Read Table with Pandas

🔢 Read Table with requests

🥣 Beautiful Soup

✅ Activity

✅ Knowledge Check