Computer Vision Engineer @ EnVsion

I was a Computer Vision Engineer at EnVsion from June 2020 to July 2021

EnVsion develops a video productivity platform to make teams faster and more efficient. Specifically, on top of usual editing functionalities, they use AI to understand video contents and allow indexing and search operations much more powerful than what’s currently available.

As a member of the founding team of this early stage company, apart from the following technical aspects of my role, I also got to witness and participate on company structuring and strategic activities.

I wrote code to extract information from video with Computer Vision, Deep Learning, and Natural Language Processing(NLP). Code is usually in Python. Computer Vision applications, that range from object/people detection to OCR, usually use OpenCV, Tesseract, TensorFlow, and PyTorch. For NLP tasks, it’s common to use Spacy. Exeperiments frequently relied on Jupyter Notebooks. Models, when possible, take advantage of GPU, using CUDA and cuDNN.

On top of these ML activities, I also participated in the deployment of AI services using AWS’s tools that include:

  • ECR and ECS, for running Docker containers
  • SQS and SNS, for communicating between AWS’s components and EnVsion’s API
  • Lambda functions for short tasks, such as video transcoding, for example
  • Transcribe for extracting transcript from video
  • S3, CloudWatch, and IAM roles, as you can’t get much done in AWS without them

Some examples of what I’ve done include:

  • Writing classes that make object detection easier, abstracting the underlying models, such as YOLO, Mask RCNN, and SSD, for example
  • Using and tweaking the implementation of Deep SORT for object tracking
  • Creating a simple and modular pipeline structure for plugging and unplugging deep learning functionalities
  • Creating a template class that allows an AI tool to listen to an AWS SQS queue for videos to be processed and its results to be posted where they need. Almost all of EnVsion’s AI services extended this class
  • Using Aeneas audio alignment library to realign manually edited portions of an auto-generated transcript to the video’s audio track
  • Segmenting videos in separate shots, using TransNetv2
  • Celebrity detection using AWS’s Rekognition
  • Extracting text from presentation/class videos with Tesseract
  • Multiple Deep Learning functionalities in road-related videos. Such as a vehicle counter and a make/model identifier that crosses information with license-plate data to detect fraud. (EnVsion started focusing on traffic videos)

The team members all worked remotely. Communication and productivity tools included GIT, Trello and Slack.