Computer Vision with Deep Learning
In computer vision, deep learning has proven useful to extract patterns from images. Deep learning uses a neural network and optimization to relate features (pixels) to a desired label. As opposed to Cascade Classifiers, deep learning does not need specialized preprocessing of the image to develop application-specific features. Deep learning can also transform text into images with AI Art Generators.
The pixels from the image are processed through multiple linear and nonlinear layers to predict an output. Deep learning generally requires many thousands of labeled examples to learn.
A Convolutional Neural Network (CNN) transforms the input image with a specialized connectivity structure. It stacks multiple stages of feature extractors. The higher stages compute more global, invariant features with a classification layer at the end. Feed-forward feature extraction convolves input with learned filters, transforms with non-linearity (sigmoid, hyperbolic tangent, rectified linear units), performs spatial pooling, and finally normalizes to create a feature map. With convolution the dependencies are local, translation is invariant, and there are few parameters (filter weights and stride). The supervised training of convolutional filters is performed by back-propagating classification error.
Deep learning has made significant progress in face recognition, image classification, speech recognition, text-to-speech generation, handwriting transcription, medical diagnosis, self-driving cars, digital assistants, advertising, search queries, and social recommendations. While there has been progress with deep learning, many of the big questions of intelligence have not been answered or properly formulated.
Activity
Identify the faces in the photo with a Multi-Task Convolutional Neural Network for Face Detection.
The Python package MTCNN is Multi-task Cascaded Convolutional Neural Networks for Face Detection. It is based on TensorFlow with pre-trained weights. The detect_faces function identifies the bounding box for the face and the position of each nose, right eye, left eye, left mouth, and right mouth.
from mtcnn.mtcnn import MTCNN
import urllib.request
# download image as class.jpg
url = 'http://apmonitor.com/pds/uploads/Main/students_walking.jpg'
urllib.request.urlretrieve(url, 'class.jpg')
def draw_faces(data, result_list):
for i in range(len(result_list)):
x1, y1, width, height = result_list[i]['box']
x2, y2 = x1 + width, y1 + height
plt.subplot(1, len(result_list), i+1)
plt.axis('off')
plt.imshow(data[y1:y2, x1:x2])
plt.show()
pixels = plt.imread('class.jpg') # read image
detector = MTCNN() # create detector
faces = detector.detect_faces(pixels) # detect faces
draw_faces(pixels, faces) # display faces
The all but one of the student faces are recognized and the bounding boxes are displayed.
The confidence for each of the detected faces is also available.
print(x['confidence'])
Each of them have a confidence over 0.999. The confidence is important to adjust the rate of false positives (Type-I errors) for face detection.
MediaPipe
MediaPipe provides fast and accurate face detection with a pre-trained Deep Learning model and a perception pipeline. MediaPipe uses a two-step detector-tracker ML pipeline. The pipeline first locates the person/pose region-of-interest (ROI) within the frame. The tracker then predicts the pose landmarks and segmentation mask within the ROI using the ROI-cropped frame as input. The detector is invoked only as needed with video, such as for the first frame and when the tracker no longer identifies body pose presence. Other frames derive the ROI from the pose landmarks of the previous frame.
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_face_mesh = mp.solutions.face_mesh
drawing_spec = mp_drawing.DrawingSpec(thickness=1, circle_radius=1)
cap = cv2.VideoCapture(0)
with mp_face_mesh.FaceMesh(max_num_faces=1,refine_landmarks=True,
min_detection_confidence=0.5, min_tracking_confidence=0.5) as face_mesh:
while cap.isOpened():
success, image = cap.read()
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue
# To improve performance, optionally mark the image as
# not writeable to pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = face_mesh.process(image)
# Draw the face mesh annotations on the image.
image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
if results.multi_face_landmarks:
for lm in results.multi_face_landmarks:
mp_drawing.draw_landmarks(image=image,landmark_list=lm,
connections=mp_face_mesh.FACEMESH_TESSELATION,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_tesselation_style())
mp_drawing.draw_landmarks(image=image,landmark_list=lm,
connections=mp_face_mesh.FACEMESH_CONTOURS,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_contours_style())
mp_drawing.draw_landmarks(image=image,landmark_list=lm,
connections=mp_face_mesh.FACEMESH_IRISES,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_iris_connections_style())
# Flip the image horizontally for a selfie-view display.
cv2.imshow('MediaPipe Face Mesh', cv2.flip(image, 1))
if cv2.waitKey(5) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Additional Information
The 3Blue1Brown Channel has an informative YouTube series on neural networks and how they learn.
- But what is a neural network?
- Gradient descent, how neural networks learn
- What is backpropagation really doing?
Adrian Rosebrock has prepared a step-by-step guide to getting started with Computer Vision, OpenCV, and Deep Learning.
There are several other packages for Face Detection. MediaPipe is a multi-platform package released by Google for static and video images with face, hands, and holistic body position detection.
- Taigman, Y., Yang, M., Ranzato, M., Wolf, L., DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014.
- Fridman, L., Deep Learning Lectures at MIT, 2020.
- Gradilla, R., Multi-task Cascaded Convolutional Networks (MTCNN) for Face Detection and Facial Landmark Alignment, Medium, Published Jul 27, 2020.
✅ Knowledge Check
1. Which statement regarding Deep Learning in Computer Vision is accurate?
- Correct. Deep learning uses a neural network and optimization techniques to relate features from images, such as pixels, to a desired label or outcome. It is a popular approach in computer vision for extracting patterns from images.
- Incorrect. Unlike Cascade Classifiers, deep learning does not need specialized preprocessing of the image to develop application-specific features.
- Incorrect. With convolution in CNNs, the dependencies are local, translation is invariant, and there are fewer parameters.
- Incorrect. Deep learning can transform text into images, for example with Generative AI.
2. What is the main function of the Python package MTCNN?
- Incorrect. MTCNN stands for Multi-task Cascaded Convolutional Neural Networks and is specifically used for Face Detection.
- Correct. MTCNN detects faces and the provided Python code with the package is used to draw bounding boxes around the detected faces in the image.
- Incorrect. MTCNN is used for face detection and not for texture classification.
- Incorrect. While MTCNN does provide confidence scores, its primary purpose is face detection, and it identifies the bounding box for the face and the position of facial landmarks.
Additional computer vision case studies and information are available in Computer Vision Introduction, Cascade Classifier, Face Detection and Recognition, Texture Classification, Hand Tracking, Bit and Crack Image Classification, and Soil Classification.
Acknowledgement
Thanks to DJ Lee, BYU ECE Professor, for the computer vision material and for sharing research and industrial experience with the class.