Cloud Access

Cloud access is an important aspect of data engineering to enable data aggregation, storage, retrieval, and enterprise scale-up. The three largest cloud service providers are Amazon, Microsoft, and Google. The services relate to hosting servers and containers, data storage as files or databases, and network and access functions. Cloud-based storage and computing scales-up as more compute resources or space is needed by an application.

Service Amazon Web Services (AWS) Microsoft Azure Google Cloud Platform (GCP)
Servers and Containers
Virtual Servers Elastic Cloud Compute Virtual Machines Google Compute Engine
Serverless Computing Lambda Azure Functions Cloud Functions
Kubernetes Management Elastic Kubernetes Service Kubernetes Service Kubernetes Engine
Data Storage
Object Storage Simple Storage Service Azure Blob Cloud Storage
File Storage Elastic File Storage Azure Files Filestore
Block Storage Elastic Block Storage Azure Disk Persistent Disk
Relational Database Relational Database Service SQL Database Cloud SQL
NoSQL Database DynamoDB Cosmos DB Firestore
Network
Virtual Network Virtual Private Cloud Azure VNet Virtual Private Network
Content Delivery Network CloudFront Azure CDN Cloud CDN
DNS Service Route 53 Traffic Manager Cloud DNS
Security and Authorization
Authentication and Authorization IAM Azure Active Directory Cloud IAM
Key Management KMS Azure Key Vault KMS
Network Security AWS WAF Application Gateway Cloud Armor

Tutorials for specific data access to each of the platforms is best obtained from the cloud hosting provider. Each provider has tutorials on accessing data with Python with several excellent tutorials for beginners. Students are typically given free or reduced-cost access to the platforms for learning. Each requires registration and an account to use the services.

Cloud services start with definition of public, private, or hybrid cloud access. The base Infrastructure as a Service (IaaS) includes bare metal servers, virtual machines, disk space, networking, and load balancers. The next level is the Platform as a Service (Paas) with platforms that utilize IaaS to run applications, databases, servers, and data lakes that span multiple storage units. These compute, storage, and networking services are further built upon to create Cloud applications or Software as a Service (SaaS) as a complete solution. Clients (computer browsers, mobile apps, IoT devices) connect to the SaaS applications to send data, retrieve results, and utilize the functions they provide. Applications are designed to scale-up resources as needed and provide distributed computing and data storage for SaaS resilience to unplanned outages.

✅ Activity

Create a Python program to monitor a directory for a photo and remove the background with the rembg Python package. Deploy the application as a service to process photos.

Service to remove image background

The rembg package installation downgrades a number of packages such as numpy so it is recommended to set up a virtual environment (venv) for the installation as shown with Install Packages.

python3 -m venv bckgrm
source bckgrm/bin/activate

With the virtual environment activated, install rembg. The extra [gpu] option can be used if running on Google Colab with pip install rembg[gpu]. The first time rembg runs, it downloads a 176 MB machine learned model.

pip install rembg websockets

Test the rembg application with the remove() function. Use other photos to test the performance. Replace image_input.jpg with the path to the input image.

from rembg import remove
from PIL import Image

input_path = 'image_input.jpg'
output_path = 'image_output.png'

# download image
import urllib.request
url = 'http://apmonitor.com/dde/uploads/Main/'+input_path
urllib.request.urlretrieve(url, input_path)

input = Image.open(input_path)
output = remove(input)
output.save(output_path)

Process Local Folder Images to Remove Background

This solution demonstrates how to monitor a local computer folder and automatically remove the background from any images placed into the input folder. This input folder could be a Dropbox or Google Drive folder where image files are placed for automatic background removal.

Create a virtual environment and install rembg.

python3 -m venv backgrm
source backgrm/bin/activate
python3 -m pip install rembg

Create a new file to run the service. The program runs for 200 sec and checks for new files in the fpath_in folder. If an image file is found, it removes the background from the image and saves the new image to the fpath_out folder. Finally, it deletes the input image. This simple program can be used to receive online image submissions and display the output through a webpage.

import os
import time
import glob
from rembg import remove
from PIL import Image

# create directories to store images
fpath_in = './input'
fpath_out = './output'
for f in [fpath_in,fpath_out]:
    try:
        os.mkdir(f)
    except:
        continue

i=0
while i<=200:
    # scan input directory every second
    time.sleep(1.0); i+=1
    fp = glob.glob(fpath_in+'/*')

    for f in fp:
        print(f)
        # open image
        img = Image.open(f)
        # remove background
        out = remove(img)
        # get file name
        img_name = os.path.basename(f)
        # save to output folder
        out.save(fpath_out+'/'+img_name+'.png')
        # remove input folder image
        os.remove(f)

Web API with Websockets to Remove Background

Create a virtual environment and install rembg and websockets.

python3 -m venv backgrm
source backgrm/bin/activate
python3 -m pip install rembg websockets

Save as server.py and run in the virtual python environment.

import asyncio
import websockets
import io
from PIL import Image
from rembg import remove

async def handle_image(websocket, path):
    image_bytes = await websocket.recv()
    image = Image.open(io.BytesIO(image_bytes))

    # remove image background
    image = remove(image)

    new_image_bytes = io.BytesIO()
    image.save(new_image_bytes, format='PNG')
    await websocket.send(new_image_bytes.getvalue())

start_server = websockets.serve(handle_image, "localhost", 8765)

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

Save as client.py and run on the local computer or on another networked computer.

import asyncio
import websockets
import urllib.request

# download image or replace with another image file
input_path = 'image_input.jpg'
url = 'http://apmonitor.com/dde/uploads/Main/'+input_path
urllib.request.urlretrieve(url, input_path)

async def send_image():
    async with websockets.connect("ws://localhost:8765") as websocket:
        with open("image_input.jpg", "rb") as image_file:
            image_bytes = image_file.read()
            await websocket.send(image_bytes)
            new_image_bytes = await websocket.recv()
            with open("new_image.png", "wb") as new_image_file:
                new_image_file.write(new_image_bytes)

asyncio.run(send_image())

See WebSocket Transfer for more information about data transfer.

Docker Container to Remove Background

Create a folder to store files Dockerfile, server.py, and requirements.txt. First, create Dockerfile as a text file with no file extension.

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./

RUN pip install --no-cache-dir -r requirements.txt

# Open port 8765
EXPOSE 8765

COPY server.py ./

CMD [ "python", "./server.py" ]

Create requirements.txt and save in the docker build folder.

rembg
websockets

Create server.py and save in the docker build folder.

import asyncio
import websockets
import io
from PIL import Image
from rembg import remove

async def handle_image(websocket, path):
    image_bytes = await websocket.recv()
    image = Image.open(io.BytesIO(image_bytes))

    # remove image background
    image = remove(image)

    new_image_bytes = io.BytesIO()
    image.save(new_image_bytes, format='PNG')
    await websocket.send(new_image_bytes.getvalue())

start_server = websockets.serve(handle_image, "localhost", 8765)

asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

Build the docker image.

docker build -t rembg-service .

Run the docker image.

docker run -p 8765:8765 -it --rm --name bkrm rembg-service

After the docker image is running, create client.py and run on the local computer or on another networked computer.

import asyncio
import websockets
import urllib.request

# download image or replace with another image file
input_path = 'image_input.jpg'
url = 'http://apmonitor.com/dde/uploads/Main/'+input_path
urllib.request.urlretrieve(url, input_path)

async def send_image():
    async with websockets.connect("ws://localhost:8765") as websocket:
        with open("image_input.jpg", "rb") as image_file:
            image_bytes = image_file.read()
            await websocket.send(image_bytes)
            new_image_bytes = await websocket.recv()
            with open("new_image.png", "wb") as new_image_file:
                new_image_file.write(new_image_bytes)

asyncio.run(send_image())

See WebSocket Transfer for more information about data transfer.


✅ Knowledge Check

1. Which of the following is NOT a cloud service provider mentioned in this review of cloud service providers?

A. Google
Incorrect. Google is one of the mentioned cloud service providers with its Google Cloud Platform that has 11% market share in 2023.
B. IBM
Correct. IBM is not listed as one of the three major cloud service providers. In 2023, it has 3% of the cloud computing market share. Other notable cloud service providers are Alibaba (4%), Salesforce (3%), Oracle (2%), and Tencent (2%).
C. Microsoft
Incorrect. Microsoft is one of the mentioned cloud service providers with its Microsoft Azure that has 22% market share in 2023.
D. Amazon
Incorrect. Amazon is one of the mentioned cloud service providers with its Amazon Web Services (AWS) with 32% market share in 2023.

2. What does the rembg package primarily help with?

A. Deploying applications on cloud
Incorrect. The rembg package is not for deploying applications but for removing image backgrounds.
B. Monitoring directories for photos
Incorrect. The rembg package does not monitor directories but removes image backgrounds.
C. Removing background from images
Correct. The primary function of the rembg package is to remove the background from images. It is deployed as a cloud service in the activity.
D. Setting up virtual environments
Incorrect. While the content suggests setting up a virtual environment before installing rembg, the package itself does not set up virtual environments.