With the advances in Multimodal generative capabilities of LLM’s today, the world of text ,video and audio is seeing a convergence of sorts. Now old photos can be re imagined. New images created from scratch, videos can be generated from just a line of text , this is where we are in the world of generative AI. This current application takes advantage of advances in computer vision and multimodal LLM’s to give a persona based voice over for a silent F1 race video.
The process schematic is depicted in Fig. 2 below. We start with a input video clip of the Formula 1 race, from which the audio is clipped manually. Using computer vision techniques the video is split into individual frames. A subset of the frames is selected keeping the sequence in order and fed to Open AI with a carefully crafted prompt to generate a voice over script. The same is passed to a text to voice API to generate a MP3 format audio voice over file
In this section we break up the process flow into steps as
Ensure you have a valid Open AI Key, Establish the Open AI connection initially before proceeding further
Here we use open CV to process the video. Initially we read the video file and convert each frame into a base64-encoded JPEG image.
The JPEG format images are stored as a list for further processing .
We also check the count of individual frames captured (I.e. count of elements in the list)
We then loop through the list of JPEG images and display the images with replacement with a small time delay. This enables us to check if the images have been properly captured from the video. Essentially we are simulating a playback from the list of static frames.
For generating a narrative we craft a prompt and inference it via Open AI , for this we use the model, "gpt-4o". While passing the content for the API call to “gpt-4o’ We sample every 50th frame from our list of frames and pass the subset to Open AI.
The prompt used is as below: ( refer appendix Fig A.1 for the code snippet )
"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video."
The summary generated as response is shown below
“Experience the adrenaline of Formula 1 racing with these thrilling highlights! Watch as the cars take on challenging curves and high-speed straights, showcasing the skill and precision of the world's top drivers. Every frame captures the intensity of the race, from heart-pounding overtakes to masterful manoeuvres. Don't miss out on this exhilarating ride through one of the most iconic circuits in the world! 🏎️💨 #F1 #Racing #Motorsport #AdrenalineRush”
In this step we will create a voiceover for this video in the style of Ted Kravitz (a great F1 commentator).
We will provide as inputs the same subset of video frames we used in a previous step to prompt GPT to give us a short script.
We again use the API Call to OPEN AI’s “gpt-4o” model.
Refer Appendix Fig A.2 for the code snippet)
The prompt used is as below:
“These are frames of a video. Create a short voiceover script in the style of Ted Kravitz. Only include the narration."
The voice over generated as response is shown below
“We're here at Spa, one of the most iconic tracks on the F1 calendar. As the Mercedes powers down towards Eau Rouge, you can feel the tension in the air. The speed is electrifying, with every corner demanding precision.
Now, watch as he closes in on the Ferrari ahead, setting up a thrilling battle for position. Down the main straight, it's wheel-to-wheel action! The crowd roars as they go neck-and-neck, the Mercedes just grabbing the edge with a brilliant move on the inside.
It's a masterclass in strategy and execution. This is why Spa always delivers an unforgettable race weekend!”
This is done as follows
We pass the script to the TTS API where it will generate an mp3 of the voiceover:
The response is generated as chunks
The response chunks are iterated over and concatenated in a single audio variable
The audio is played out using a python package 'Audio'
The Audio from text clearly mentions that this is the Belgian Track ‘Spa’, which is a remarkable as this is not explicitly seen in the video. The model somehow is able to infer this from the track itself.
Another noteworthy observation is the fact that the narrative mentions that a Mercedes car overtaking a Ferrari. Given that the video was shot from above and to the audience a text logo of the car may not be visible, this is an input based on the car shape and color from the video
The current use case has potential to be applied in different domains as follows.
In general in the sporting world the use case maybe modified for generating audio tracks for sports events which either have a video recording only or is recorded in one language and maybe translated to other languages.
The above concept maybe extended to Generating Audio for movies or video recordings which had a poor or no audio recording
The Multi Modal approach may also be used in Education where text or slides based content now maybe supplemented with Audio Transcripts matching the content
The code and the video file used in this project maybe downloaded from the below github
The relevant code snippets are provided below:
Dr. Anish Roychowdhury
Anish is as Data Science Consultant and Educator with a total career experience of 20 plus years across industry and academia. Having both taught in full time and part time roles at leading B Schools and Technology Schools, including SP Jain School of Global Management, and is currently associated with Plaksha University, Mohali. He has also held leadership roles in multiple organizations He holds a Ph.D., in computational Micro Systems from IISc Bangalore) with a Master’s Thesis in the area of Microfabrication, (Louisiana State University, USA) and an Undergraduate degree from NIT Durgapur with several peer reviewed publications and awarded conference presentations
Dr. Pankaj Pansari
Pankaj is an active researcher in machine learning, computer vision, and mathematical optimization. He holds a PhD from Oxford University on discrete optimization, and an undergraduate degree in Electrical Engineering from Indian Institute of Technology, Kharagpur. He is currently a Visiting Faculty at Plaksha University. In previous roles he has been a Research Scientist at Naver Labs Europe, working on solving hard discrete optimization problems using reinforcement learning. He has also been a Data Analyst in General Electric, where he built statistical models to monitor the health of a fleet of steam turbines.