Web Scraping

Internet data is a rich source of information. Much of the online information is designed for web-browsers and viewed by humans. Web scraping is data retrieval and curation of online information by a computer program. Scraping automates tedious manual retrieval of information and can be used to watch for updates. The exercises in this section demonstrate how to retrieve data from a website such as an image and a table.

📷 Download Image

Download python_web_scrape.png from this webpage using the urllib library.

import urllib.request

# download image
img = 'python_web_scrape.png'
url = 'http://apmonitor.com/dde/uploads/Main/'+img
urllib.request.urlretrieve(url, img)

Packages for image manipulation and computer vision include Pillow (Python Imaging Library), Scikit-image, Matplotlib (uses Pillow functions), and OpenCV. OpenCV is the most capable computer vision package and is supported in many development environments. Display the image with Matplotlib.

import matplotlib.pyplot as plt
im = plt.imread(img)
plt.imshow(im)

🔢 Read Table with Pandas

The Pandas Time-Series exercise has a sample exercise that demonstrates how to read a text file either from a local directory or from a URL address. However, suppose data is online as an HTML table.

Read the table into Python with Pandas read_html() function. This function returns any tables on a webpage as a list. Use [0] to retrieve the first table.

import pandas as pd
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping'
data = pd.read_html(url)[0]
data

The table can be modified as a DataFrame, such as setting the index as time.

data.set_index('time',inplace=True)
data
time Q1 Q2 T1 T2
0.0 0.0 0.0 20.9495 20.9495
5.0 0.0 0.0 20.9495 20.9495
10.0 70.0 0.0 20.9495 20.9495
15.0 70.0 0.0 21.5941 20.9495
20.0 70.0 0.0 22.2387 20.9495
25.0 70.0 0.0 22.8833 20.9495
30.0 70.0 0.0 23.8502 20.9495
35.0 70.0 0.0 25.1394 21.2718
40.0 70.0 0.0 26.1063 21.2718
45.0 70.0 0.0 27.0732 21.5941
50.0 70.0 0.0 28.3624 21.5941
55.0 70.0 0.0 29.3293 21.5941
60.0 70.0 0.0 30.6185 21.9164

🔢 Read Table with requests

Many websites are designed to block program (bot) access to avoid Distributed Denial of Service (DDoS) attacks that can overwhelm a web service. Before sending many requests, it is important to check with the website owner to not overload the service. Read the table with requests to include a header that emulates a browser.

import requests
# look like a browser
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) "+
                "AppleWebKit/537.36 (KHTML, like Gecko) "+
                "Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
data = pd.read_html(r.text)[0]
data.set_index('time',inplace=True)
data

🥣 Beautiful Soup

Beautiful Soup is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser and functions for iterating, searching, and modifying the parse tree. First, get the html source from a webpage such as this page.

import requests
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping?action=print'
page = requests.get(url)

The attribute page.content contains the html source if page_status_code starts with a 2 such as 200 (downloaded successfully). A 4 or 5 indicates an error. BeautifulSoup parses HTML or XML files.

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

Functions such as print(soup.prettify()) can be used to view the structured output or the page title

print(soup.title.text)
  Web Scraping

All of the links are extracted:

for link in soup.find_all('a'):
    print('Link Text: {}'.format(link.text))
    print('href: {}'.format(link.get('href')))

Pandas uses BeautifulSoup to extract tables from webpages. Data scraping is particularly useful for getting information from webpages that are updated with new information such as weather, stock data, and customer reviews. More advanced web scraping packages are MechanicalSoup, Scrapy, and Selenium.

✅ Activity

Practice web scraping to retrieve data from another website of interest that contains a table. Organize the content into a DataFrame and export the DataFrame to a CSV file. Below is an example of retrieval from the Wikipedia article on Data Tables where the data table is saved as test.csv. Change the url, use requests with a browser header if necessary, and export the data file.

import pandas as pd
import requests
url = 'https://en.wikipedia.org/wiki/Table_(information)'
req = requests.get(url)
data = pd.read_html(req.content)[0]
data.to_csv('test.csv')
data

import pandas as pd
import os
try:
    os.mkdir('./tables')
except:
    print('Directory already exists')
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup'
req = requests.get(url)
tables = pd.read_html(req.content)
for i,t in enumerate(tables):
    print(f'----- Table {i} ----------------------------')
    print(t.head(3))
    t.to_csv('./tables/table_'+str(i)+'.csv')

✅ Knowledge Check

1. Which package is primarily used to extract (scrape) information from web pages by iterating, searching, and modifying the parse tree?

A. Pandas
Incorrect. While Pandas can be used to extract tables from web pages, it's not primarily used for web scraping and modifying the parse tree.
B. Matplotlib
Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.
C. BeautifulSoup
Correct. BeautifulSoup is a Python package primarily used for extracting information from web pages and offers functions to iterate, search, and modify the parse tree.
D. OpenCV
Incorrect. OpenCV is primarily used for computer vision and not for web scraping.

2. Which of the following statements regarding web scraping is accurate?

A. The read_html() function in Pandas requires the URL to be passed directly for web scraping.
Incorrect. The read_html() function in Pandas can accept both URLs and HTML content directly.
B. All websites allow bot access for web scraping.
Incorrect. Many websites block program (bot) access to avoid overloads or misuse. Always check with the website's terms and conditions before scraping.
C. BeautifulSoup requires an HTML or XML parser to function.
Correct. BeautifulSoup uses an HTML or XML parser to function and provides various functions to iterate, search, and modify the parse tree.
D. Matplotlib is a web scraping package that can be used to retrieve tables from web pages.
Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.