DAY 79-100 DAYS MLCODE: Object Detection and Segmentation in Video

In the previous blog, we discussed Object detection and segmentation using Mask R-CNN for image, in this blog, we’ll try to implement Object Detection and Segmentation in Video using Mask R-CNN.

You can find the more detail about Mask R-CNN and how to use in previous blog. Let’s set the batch size of model to 5 so that model will process 5 frames at a time.

Configure the batch size like below:

class InferenceConfig(coco.CocoConfig):
# Set batch size to 1 since we’ll be running inference on
# one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
GPU_COUNT = 1
IMAGES_PER_GPU = 5

config = InferenceConfig()
config.display()

Output of the above cell will look like below:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 5
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.7
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 3
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 93
IMAGE_MIN_DIM 800
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {‘rpn_class_loss’: 1.0, ‘rpn_bbox_loss’: 1.0, ‘mrcnn_class_loss’: 1.0, ‘mrcnn_bbox_loss’: 1.0, ‘mrcnn_mask_loss’: 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 100
MEAN_PIXEL [123.7 116.8 103.9]
MINI_MASK_SHAPE (56, 56)
NAME coco
NUM_CLASSES 81
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.7
RPN_TRAIN_ANCHORS_PER_IMAGE 256
STEPS_PER_EPOCH 1000
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.0001

The above output suggest that the batch size is set to 5. So our model now process 5 image per feed unlike image detection where we were passing 1.

Create Model and Load Trained Weights

# Create model object in inference mode.
model = modellib.MaskRCNN(mode=”inference”, model_dir=MODEL_DIR, config=config)

# Load weights trained on MS-COCO
model.load_weights(COCO_MODEL_PATH, by_name=True)

Class names:

Download the class name from COCO dataset and store the values in LABEL list.

!wget “https://raw.githubusercontent.com/nightrome/cocostuff/master/labels.txt”

# load the COCO class labels our YOLO model was trained on
labelsPath = os.path.sep.join([os.getcwd(),”labels.txt”])
LABELS = open(labelsPath).read().strip(‘:’).split(“\n”)

Create a class_name list which we’ll pass to the model for predictions.

class_name = []
for data in LABELS:
print(data)
head, tail = data.split(“:”)
class_name.append(tail.strip())

Data of the class_name will look like below:

Initialize the pointer to video so that we can read frame by frame during the final phase. Video_write is required because we have to save the new vidoe.

import cv2
VIDEO_SAVE_DIR = ‘/content/Mask_RCNN’
# load our input image and grab its spatial dimensions
# initialize the video stream, pointer to output video file, and
# frame dimensions
video = cv2.VideoCapture(“Video.mp4”)
video_writer = None
(Weidth, Height) = (None, None)
prop = cv2.CAP_PROP_FRAME_COUNT
total = int(video.get(prop))

Now run the final Object detection on the vide using below code:

total_frame_count = 0
frames = []
batch_size = 3
while True:
# read the next frame from the file
(found, img) = video.read()

# if the frame was not grabbed, then we have reached the end
# of the stream
if not found:
break
total_frame_count += 1
frames.append(img)
if len(frames) == batch_size:
predict = model.detect(frames, verbose=0)
for i, item in enumerate(zip(frames, predict)):
frame = item[0]
r = item[1]
frame = display_instances(
frame, r[‘rois’], r[‘masks’], r[‘class_ids’], class_name, r[‘scores’]
)
name = ‘{0}.jpg’.format(total_frame_count + i – batch_size)
name = os.path.join(VIDEO_SAVE_DIR, name)
#cv2.imwrite(name, frame)
# check if the video writer is None
if video_writer is None:
# initialize our video video_writer
fourcc = cv2.VideoWriter_fourcc(*”MJPG”)
video_writer = cv2.VideoWriter(‘output.avi’,cv2.VideoWriter_fourcc(‘M’,’J’,’P’,’G’), 10, (img.shape[1], img.shape[0]), True)
# write the output frame to disk
video_writer.write(frame)
# release the file pointers
# Clear the frames array to start the next batch
frames = []
print(“[INFO] cleaning up…”)
video_writer.release()
video.release()

Output: [INFO] cleaning up… This means processing is completed. Let’s view the output.

Out – Mask R-CNN

In conclusion, Mask R-CNN was able to perform both Object detection and Object segmentation with ease and with the help of repo, it was very easy to implement. You can find today’s code here.

#100DaysofMLCode #Mask R-CNN #Object Segmentation #ObjectDetection

DAY 79-100 DAYS MLCODE: Object Detection and Segmentation in Video