YouTube Video Summarization with Python

Introduction

I wonder whether you do the same, but whenever I have any free time, I frequently waste hours viewing the widest variety of films on YouTube. Videos such as "7 secrets to success," "the 10 most useful machine learning tools," or even "the 5 most beautiful places in London" are frequently included.

To make the video continue longer and get more viewers, the person begins an interminable monologue as soon as you open it, instead of just telling you what you want to know.

Occasionally, though, it seems as though via magic, you come across a saint in the comments who summarises the movie and provides you with a list of its key points so you don't have to waste thirty minutes (or fifteen minutes twice) staring at it!

I thus had the idea one day, "since I'm good with machine learning, couldn't I just have these videos automatically summarised?"

I will discuss my attempt to construct a functional, but slightly flawed, tiny Python program in this article.

Download the Audio from Youtube

We must first figure out how to download the YouTube video. Actually, we simply need the audio and don't need the entire video. Thus, we will only download the audio after extracting it from the video.

Thus, we use pip to install the library and the following method to obtain the audio from YouTube.

 
!pip install pytube -q 
from pytube import YouTube
# Specify the YouTube video URL
VIDEO_URL = 'https://www.youtube.com/watch?v=h-JVjs9AAmQ' # Example video
# Download only the audio stream as an mp4 file
yt = YouTube(VIDEO_URL)
yt.streams.filter(only_audio=True, 
file_extension='mp4').first().download(filename='ytaudio.mp4')   

Explanation:

This script downloads the audio as an MP4 file from a particular YouTube video using the pytube library. The YouTube class is first imported from Python and the video's URL is specified. The streams of yt.filter(file_extension='mp4', only_audio=True).initially().The download(filename='ytaudio.mp4') line downloads the audio in MP4 format solely by filtering the available streams with the filename ytaudio.mp4.

Convert MP4 to WAV and Check the Audio

Was the audio file downloaded correctly? By sending the audio straight from the notebook, let's verify.

 
# Convert the downloaded audio file from mp4 to wav format using ffmpeg
!ffmpeg -i ytaudio.mp4 -acodec pcm_s16le -ar 16000 ytaudio.wav
# Check the audio sample rate using librosa
import librosa
input_file = 'ytaudio.wav'
print(librosa.get_samplerate(input_file))   

Explanation:

This script uses ffmpeg with a specific audio codec (pcm_s16le) and a 16 kHz sample rate to convert an MP4 audio file to a WAV format. Using the librosa library, it uses the converted WAV file to measure its audio sample rate. The conversion command makes sure the audio works with programs that need the given sample rate and WAV format.

Audio to Text

The audio recording must then be converted to text in order to achieve a low word error rate. This will be helpful since the text may then be immediately processed by an NLP algorithm for summarisation.

More information on the model we'll use to convert text to text can be found here.

 
!pip install huggingsound -q 
from huggingsound import SpeechRecognitionModel
import torch
# Set the device to GPU if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize the speech recognition model
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english", device=device)
# Stream over 30-second chunks rather than load the full file
stream = librosa.stream(input_file, block_length=30, frame_length=16000, hop_length=16000)
import soundfile as sf
# Save each chunk as a separate wav file
for i, speech in enumerate(stream):
    sf.write(f'{i}.wav', speech, 16000)
# Transcribe each chunk
audio_path = [f'{i}.wav' for i in range(len(stream))]
transcriptions = model.transcribe(audio_path)
# Combine the transcriptions into a single text
full_transcript = ' '.join([item['transcription'] for item in transcriptions])   

Explanation:

With a pre-trained Wav2Vec2 model, this application uses the huggingsound library to do speech-to-text transcription. Initialising the voice recognition model comes when the device for computing (GPU if available, CPU otherwise) is set up. Using the librosa library, the audio file is processed in 30-second segments and stored as a distinct WAV file for each chunk. After that, each WAV file is processed, and all of the transcriptions are put together into a single text string to create the complete transcription of the audio file.

Text Summarization

The only thing left to do is to summarise the text that we took out of the movie.

Simply choose the hugging face filter on the summarisation button to select the summarisation model that best fits your situation out of hundreds available.

I'll be using the Google/Pegasus-Xsum methodology for this project. The model's specifics are available here; I'll also discuss the theory underlying these summarisation methods in upcoming publications.

It's really easy to utilise these pre-trained models from HugginFace; just have a look at how I employ summarisation in a few lines of code.

 
from transformers import pipeline
# Initialize the summarization model
summarizer = pipeline("summarization", "google/pegasus-xsum")
# Summarize the text in chunks of 1000 characters
num_iters = len(full_transcript) // 1000
summarized_text = []
for i in range(num_iters + 1):
    start = i * 1000
    end = (i + 1) * 1000
    summary_chunk = summarizer(full_transcript[start:end], min_length=5, max_length=20)
    summarized_text.append(summary_chunk[0]['summary_text'])
# Combine the summarized chunks
final_summary = ' '.join(summarized_text)
print(final_summary)   

Explanation:

This application breaks up a big text into 1000-character pieces and summarises it using the Google Pegasus XSum model. Iteratively breaking down each section into a succinct summary, it then compiles these summaries into a final, condensed version of the source text.

Next TopicPython numbers

← prev next →