Web Scraping
Internet data is a rich source of information. Much of the online information is designed for web-browsers and viewed by humans. Web scraping is data retrieval and curation of online information by a computer program. Scraping automates tedious manual retrieval of information and can be used to watch for updates. The exercises in this section demonstrate how to retrieve data from a website such as an image and a table.
📷 Download Image
Download python_web_scrape.png from this webpage using the urllib library.
# download image
img = 'python_web_scrape.png'
url = 'http://apmonitor.com/dde/uploads/Main/'+img
urllib.request.urlretrieve(url, img)
Packages for image manipulation and computer vision include Pillow (Python Imaging Library), Scikit-image, Matplotlib (uses Pillow functions), and OpenCV. OpenCV is the most capable computer vision package and is supported in many development environments. Display the image with Matplotlib.
im = plt.imread(img)
plt.imshow(im)
🔢 Read Table with Pandas
The Pandas Time-Series exercise has a sample exercise that demonstrates how to read a text file either from a local directory or from a URL address. However, suppose data is online as an HTML table.
Read the table into Python with Pandas read_html() function. This function returns any tables on a webpage as a list. Use [0] to retrieve the first table.
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping'
data = pd.read_html(url)[0]
data
The table can be modified as a DataFrame, such as setting the index as time.
data
time | Q1 | Q2 | T1 | T2 |
---|---|---|---|---|
0.0 | 0.0 | 0.0 | 20.9495 | 20.9495 |
5.0 | 0.0 | 0.0 | 20.9495 | 20.9495 |
10.0 | 70.0 | 0.0 | 20.9495 | 20.9495 |
15.0 | 70.0 | 0.0 | 21.5941 | 20.9495 |
20.0 | 70.0 | 0.0 | 22.2387 | 20.9495 |
25.0 | 70.0 | 0.0 | 22.8833 | 20.9495 |
30.0 | 70.0 | 0.0 | 23.8502 | 20.9495 |
35.0 | 70.0 | 0.0 | 25.1394 | 21.2718 |
40.0 | 70.0 | 0.0 | 26.1063 | 21.2718 |
45.0 | 70.0 | 0.0 | 27.0732 | 21.5941 |
50.0 | 70.0 | 0.0 | 28.3624 | 21.5941 |
55.0 | 70.0 | 0.0 | 29.3293 | 21.5941 |
60.0 | 70.0 | 0.0 | 30.6185 | 21.9164 |
🔢 Read Table with requests
Many websites are designed to block program (bot) access to avoid Distributed Denial of Service (DDoS) attacks that can overwhelm a web service. Before sending many requests, it is important to check with the website owner to not overload the service. Read the table with requests to include a header that emulates a browser.
# look like a browser
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) "+
"AppleWebKit/537.36 (KHTML, like Gecko) "+
"Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
r = requests.get(url, headers=header)
data = pd.read_html(r.text)[0]
data.set_index('time',inplace=True)
data
🥣 Beautiful Soup
Beautiful Soup is a Python package for extracting (scraping) information from web pages. It uses an HTML or XML parser and functions for iterating, searching, and modifying the parse tree. First, get the html source from a webpage such as this page.
url = 'http://apmonitor.com/dde/index.php/Main/WebScraping?action=print'
page = requests.get(url)
The attribute page.content contains the html source if page_status_code starts with a 2 such as 200 (downloaded successfully). A 4 or 5 indicates an error. BeautifulSoup parses HTML or XML files.
soup = BeautifulSoup(page.content, 'html.parser')
Functions such as print(soup.prettify()) can be used to view the structured output or the page title
Web Scraping
All of the links are extracted:
print('Link Text: {}'.format(link.text))
print('href: {}'.format(link.get('href')))
Pandas uses BeautifulSoup to extract tables from webpages. Data scraping is particularly useful for getting information from webpages that are updated with new information such as weather, stock data, and customer reviews. More advanced web scraping packages are MechanicalSoup, Scrapy, and Selenium.
✅ Activity
Practice web scraping to retrieve data from another website of interest that contains a table. Organize the content into a DataFrame and export the DataFrame to a CSV file. Below is an example of retrieval from the Wikipedia article on Data Tables where the data table is saved as test.csv. Change the url, use requests with a browser header if necessary, and export the data file.
import requests
url = 'https://en.wikipedia.org/wiki/Table_(information)'
req = requests.get(url)
data = pd.read_html(req.content)[0]
data.to_csv('test.csv')
data
import os
try:
os.mkdir('./tables')
except:
print('Directory already exists')
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup'
req = requests.get(url)
tables = pd.read_html(req.content)
for i,t in enumerate(tables):
print(f'----- Table {i} ----------------------------')
print(t.head(3))
t.to_csv('./tables/table_'+str(i)+'.csv')
✅ Knowledge Check
1. Which package is primarily used to extract (scrape) information from web pages by iterating, searching, and modifying the parse tree?
- Incorrect. While Pandas can be used to extract tables from web pages, it's not primarily used for web scraping and modifying the parse tree.
- Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.
- Correct. BeautifulSoup is a Python package primarily used for extracting information from web pages and offers functions to iterate, search, and modify the parse tree.
- Incorrect. OpenCV is primarily used for computer vision and not for web scraping.
2. Which of the following statements regarding web scraping is accurate?
- Incorrect. The read_html() function in Pandas can accept both URLs and HTML content directly.
- Incorrect. Many websites block program (bot) access to avoid overloads or misuse. Always check with the website's terms and conditions before scraping.
- Correct. BeautifulSoup uses an HTML or XML parser to function and provides various functions to iterate, search, and modify the parse tree.
- Incorrect. Matplotlib is primarily used for data visualization and not for web scraping.