Computer Vision with Deep Learning

In computer vision, deep learning has proven useful to extract patterns from images. Deep learning uses a neural network and optimization to relate features (pixels) to a desired label. As opposed to Cascade Classifiers, deep learning does not need specialized preprocessing of the image to develop application-specific features. Deep learning can also transform text into images with AI Art Generators.

The pixels from the image are processed through multiple linear and nonlinear layers to predict an output. Deep learning generally requires many thousands of labeled examples to learn.

A Convolutional Neural Network (CNN) transforms the input image with a specialized connectivity structure. It stacks multiple stages of feature extractors. The higher stages compute more global, invariant features with a classification layer at the end. Feed-forward feature extraction convolves input with learned filters, transforms with non-linearity (sigmoid, hyperbolic tangent, rectified linear units), performs spatial pooling, and finally normalizes to create a feature map. With convolution the dependencies are local, translation is invariant, and there are few parameters (filter weights and stride). The supervised training of convolutional filters is performed by back-propagating classification error.

Deep learning has made significant progress in face recognition, image classification, speech recognition, text-to-speech generation, handwriting transcription, medical diagnosis, self-driving cars, digital assistants, advertising, search queries, and social recommendations. While there has been progress with deep learning, many of the big questions of intelligence have not been answered or properly formulated.


Identify the faces in the photo with a Multi-Task Convolutional Neural Network for Face Detection.

The Python package MTCNN is Multi-task Cascaded Convolutional Neural Networks for Face Detection. It is based on TensorFlow with pre-trained weights. The detect_faces function identifies the bounding box for the face and the position of each nose, right eye, left eye, left mouth, and right mouth.

import matplotlib.pyplot as plt
from mtcnn.mtcnn import MTCNN
import urllib.request

# download image as class.jpg
url = ''
urllib.request.urlretrieve(url, 'class.jpg')

def draw_faces(data, result_list):
    for i in range(len(result_list)):
        x1, y1, width, height = result_list[i]['box']
        x2, y2 = x1 + width, y1 + height
        plt.subplot(1, len(result_list), i+1)
        plt.imshow(data[y1:y2, x1:x2])

pixels = plt.imread('class.jpg')      # read image
detector = MTCNN()                    # create detector
faces = detector.detect_faces(pixels) # detect faces
draw_faces(pixels, faces)             # display faces

The all but one of the student faces are recognized and the bounding boxes are displayed.

The confidence for each of the detected faces is also available.

for x in faces:

Each of them have a confidence over 0.999. The confidence is important to adjust the rate of false positives (Type-I errors) for face detection.


MediaPipe provides fast and accurate face detection with a pre-trained Deep Learning model and a perception pipeline. MediaPipe uses a two-step detector-tracker ML pipeline. The pipeline first locates the person/pose region-of-interest (ROI) within the frame. The tracker then predicts the pose landmarks and segmentation mask within the ROI using the ROI-cropped frame as input. The detector is invoked only as needed with video, such as for the first frame and when the tracker no longer identifies body pose presence. Other frames derive the ROI from the pose landmarks of the previous frame.

import cv2
import mediapipe as mp
mp_drawing =
mp_drawing_styles =
mp_face_mesh =

drawing_spec = mp_drawing.DrawingSpec(thickness=1, circle_radius=1)
cap = cv2.VideoCapture(0)
with mp_face_mesh.FaceMesh(max_num_faces=1,refine_landmarks=True,
    min_detection_confidence=0.5, min_tracking_confidence=0.5) as face_mesh:
    while cap.isOpened():
        success, image =
        if not success:
            print("Ignoring empty camera frame.")
            # If loading a video, use 'break' instead of 'continue'.

        # To improve performance, optionally mark the image as
        #   not writeable to pass by reference.
        image.flags.writeable = False
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        results = face_mesh.process(image)

        # Draw the face mesh annotations on the image.
        image.flags.writeable = True
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
        if results.multi_face_landmarks:
            for lm in results.multi_face_landmarks:
        # Flip the image horizontally for a selfie-view display.
        cv2.imshow('MediaPipe Face Mesh', cv2.flip(image, 1))
        if cv2.waitKey(5) & 0xFF == ord('q'):

Additional Information

The 3Blue1Brown Channel has an informative YouTube series on neural networks and how they learn.

Adrian Rosebrock has prepared a step-by-step guide to getting started with Computer Vision, OpenCV, and Deep Learning.

There are several other packages for Face Detection. MediaPipe is a multi-platform package released by Google for static and video images with face, hands, and holistic body position detection.

✅ Knowledge Check

1. Which statement regarding Deep Learning in Computer Vision is accurate?

A. Deep learning uses a neural network and optimization to relate image features to a desired label.
Correct. Deep learning uses a neural network and optimization techniques to relate features from images, such as pixels, to a desired label or outcome. It is a popular approach in computer vision for extracting patterns from images.
B. Deep learning always requires preprocessing of the image to develop application-specific features.
Incorrect. Unlike Cascade Classifiers, deep learning does not need specialized preprocessing of the image to develop application-specific features.
C. Convolutional Neural Networks (CNNs) mainly function by using a large number of parameters.
Incorrect. With convolution in CNNs, the dependencies are local, translation is invariant, and there are fewer parameters.
D. Deep learning cannot be used to transform text into images.
Incorrect. Deep learning can transform text into images, for example with Generative AI.

2. What is the main function of the Python package MTCNN?

A. It is a package for handwriting transcription using deep learning.
Incorrect. MTCNN stands for Multi-task Cascaded Convolutional Neural Networks and is specifically used for Face Detection.
B. MTCNN is used to display bounding boxes around detected student faces.
Correct. MTCNN detects faces and the provided Python code with the package is used to draw bounding boxes around the detected faces in the image.
C. MTCNN is primarily used for texture classification.
Incorrect. MTCNN is used for face detection and not for texture classification.
D. The primary purpose of MTCNN is to provide confidence scores for face detection and not the actual face detection.
Incorrect. While MTCNN does provide confidence scores, its primary purpose is face detection, and it identifies the bounding box for the face and the position of facial landmarks.

Additional computer vision case studies and information are available in Computer Vision Introduction, Cascade Classifier, Face Detection and Recognition, Texture Classification, Hand Tracking, Bit and Crack Image Classification, and Soil Classification.


Thanks to DJ Lee, BYU ECE Professor, for the computer vision material and for sharing research and industrial experience with the class.