Computer Vision Engineer @ EnVsion

I was a Computer Vision Engineer at EnVsion from June 2020 to July 2021

EnVsion develops a video productivity platform to make teams faster and more efficient. Specifically, on top of usual editing functionalities, they use AI to understand video contents and allow indexing and search operations much more powerful than what’s currently available.

As a member of the founding team of this early stage company, apart from the following technical aspects of my role, I also got to witness and participate on company structuring and strategic activities.

I wrote code to extract information from video with Computer Vision, Deep Learning, and Natural Language Processing(NLP). Code is usually in Python. Computer Vision applications, that range from object/people detection to OCR, usually use OpenCV, Tesseract, TensorFlow, and PyTorch. For NLP tasks, it’s common to use Spacy. Exeperiments frequently relied on Jupyter Notebooks. Models, when possible, take advantage of GPU, using CUDA and cuDNN.

On top of these ML activities, I also participated in the deployment of AI services using AWS’s tools that include:

ECR and ECS, for running Docker containers
SQS and SNS, for communicating between AWS’s components and EnVsion’s API
Lambda functions for short tasks, such as video transcoding, for example
Transcribe for extracting transcript from video
S3, CloudWatch, and IAM roles, as you can’t get much done in AWS without them

Some examples of what I’ve done include:

Writing classes that make object detection easier, abstracting the underlying models, such as YOLO, Mask RCNN, and SSD, for example
Using and tweaking the implementation of Deep SORT for object tracking
Creating a simple and modular pipeline structure for plugging and unplugging deep learning functionalities
Creating a template class that allows an AI tool to listen to an AWS SQS queue for videos to be processed and its results to be posted where they need. Almost all of EnVsion’s AI services extended this class
Using Aeneas audio alignment library to realign manually edited portions of an auto-generated transcript to the video’s audio track
Segmenting videos in separate shots, using TransNetv2
Celebrity detection using AWS’s Rekognition
Extracting text from presentation/class videos with Tesseract
Multiple Deep Learning functionalities in road-related videos. Such as a vehicle counter and a make/model identifier that crosses information with license-plate data to detect fraud. (EnVsion started focusing on traffic videos)

The team members all worked remotely. Communication and productivity tools included GIT, Trello and Slack.