How to transcribe audio with OpenAI's Whisper API in Google Colab (Python)

Transcribing audio can be a game-changer for content creators, researchers, and anyone needing accurate text from spoken words. With OpenAI’s Whisper API, the process is not only quick and efficient but also incredibly precise. I’ve explored various transcription tools, and Whisper stands out for its ease of use and powerful capabilities, related to capturing punctuation and mixed language use in audio.

In this guide, I’ll walk you through how to transcribe audio using OpenAI’s Whisper API. Whether you’re new to transcription or looking to streamline your workflow, this tutorial will provide clear, actionable steps to get you started. Let’s dive into unlocking the potential of your audio content with Whisper’s advanced technology.

How do speech-to-text models work

Audio transcription involves converting spoken words into written text. This process leverages advanced deep learning technologies like deep neural networks, but also approaches like entity recognition and POS tagging to identify and translate speech patterns accurately. Using combinations of various machine learning techniques for both audio processing and natural language processing, transcription models like OpenAI’s Whisper API can handle various accents and languages, making the transcription process more accessible and efficient.

The first step in audio transcription is speech recognition. Any audio transcription API uses deep learning models to analyze audio files and recognize spoken words. These models are trained on diverse datasets to improve accuracy in real-world scenarios. Many models also have steps for audio processing to detect things like pauses, reduce background noise and so on.

Following speech recognition, the next step is text processing. The recognised speech is processed to ensure readability and coherence. The API applies natural language processing (NLP) techniques to punctuation, casing, grammar, and context to ensure that the transcribed text matches the original audio’s intent and meaning.

The final step in audio transcription is output formatting. Here, the transcribed text is formatted into a readable document. Modern speech-to-text APIs support various output formats, and come with punctuation out of the box.

About the model: Open AI’s Whisper API

OpenAI’s Whisper API utilises state-of-the-art technology for audio transcription. This model ensures high accuracy and multilingual support, making it a robust tool for various transcription needs.

How It Works

OpenAI’s Whisper API enables users to leverage their state-of-the-art open source large-v2 speech-to-text model, Whisper. Trained on 680,000 hours of diverse, multilingual, and multitask data from the web, Whisper excels in transcribing audio in up to 60 languages, including English, Chinese, Arabic, Hindi, and French. Ideal for applications in transcription services, language translation, and real-time communication, Whisper delivers high accuracy and performance.

There are two endpoints of the Audio API on speech-to-text tasks, which can be used for:

Transcription of audio into whatever language the audio is in.
Translation and transcription the audio into english.

The API can take uploads of files up to 25 MB in one of the following file types: mp3, mp4, mpeg, mpga, m4a, wav, and webm.

Additional Resources

The OpenAI API documentation offers detailed guides on integrating the various APIs and models effectively. It includes code examples, usage tips, and troubleshooting information.

There have also been some updates to how the calls to the Whisper API (and other OpenAI models) are made. You can review all of the changes here.

Step-by-step guide on using OpenAI’s Whisper API for Audio Transcription in Google Colab (Python)

Easily transcribe audio using OpenAI’s Whisper API in Google Colab with this guide. Follow the detailed steps for single and multiple file transcription.

Prerequisites

Google Account: Ensure you have a Google account for accessing Google Colab.
Google Colab: Get familiar with Google Colab as it will be the platform for executing Python code.
OpenAI API Key: Obtain an API key from OpenAI. Sign up on OpenAI’s website and navigate to the API section to generate your unique key.
Credits in your API account: Crediting your account is necessary to ensure you can execute the task with the API.
Basic Python Knowledge: Have a fundamental understanding of Python programming to follow the code snippets and their execution.

Using the Google Colab Template for Single File Transcription

Initial Setup: Open a new notebook in Google Colab (or make a copy of our template). Ensure the runtime is set to Python.
Install Required Libraries:

!pip install OpenAI

Authenticate API Key:

import openai
import os

# Set your OpenAI API key here
OPENAI_API_KEY = 'your_openai_api_key'
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

openai.api_key = os.getenv('OPENAI_API_KEY')

client=OpenAI()

Upload Audio File: Use Google Colab’s file upload feature to upload the audio you want to transcribe.

from google.colab import files

# Upload audio file
uploaded = files.upload()

file_path = next(iter(uploaded))

Define the transcription function:

def transcribe_audio(file_path):
    with open(file_path, 'rb') as audio_file:
        response = client.audio.transcriptions.create(
            model='whisper-1',
            file=audio_file
        )
    return response

6. Call the function: Transcribe the audio file.

# Transcribe the uploaded audio file
transcribe_audio(file_path)

Using the Google Colab Template for Transcription of multiple files (in Bulk)

Complete steps 1-3 as above.

For easier access, make a copy of the Google Colab template.

The script processes multiple uploaded audio files by iterating through each file, transcribing the audio content using the OpenAI Whisper model, and saving the transcription to a text file.

For each audio file, the corresponding text file is created with the same base name and a “.txt” extension. After saving the transcription, the script provides a download link for the text file, allowing users to download each transcription to their local machine. This ensures that each audio file is individually transcribed and easily accessible in text format.

def transcribe_audio(file_path):
    with open(file_path, 'rb') as audio_file:
        response = client.audio.transcriptions.create(
            model='whisper-1',
            file=audio_file
        )
    return response.text

from google.colab import files

# Upload multiple audio files
uploaded = files.upload()

# Iterate over the uploaded files and transcribe each one
for file_name, file_info in uploaded.items():
    file_path = file_name
    transcription = transcribe_audio(file_path)
    
    # Write the transcription to a text file
    output_file_name = f"{os.path.splitext(file_name)[0]}.txt"
    with open(output_file_name, 'w') as output_file:
        output_file.write(transcription)
    
    print(f"Transcription saved to {output_file_name}")
    
    # Download the text files
    files.download(output_file_name)

For greater scalability, you can modify the script to pull audio files directly from Google Drive cloud storage or integrate with Google Cloud Storage and BigQuery for handling larger datasets and performing advanced data analysis.

Benefits and Limitations of audio transcription with OpenAI’s Whisper API in Python

Below are the benefits of the OpenAI Whisper API for speech-to-text (audio transcription):

High Accuracy: The Whisper API returns responses with precise word identification, as well as precise identification of pauses, punctuation, language spoken (even when languages are mixed in the same audio file). This high accuracy translates into reliable text outputs, minimizing errors.

Multilingual Support: OpenAI’s Whisper API supports multiple languages, making it versatile for users who need to transcribe audio in various languages. This feature broadens the scope of its application globally.

Scalability: The provided script facilitates the transcription of both single and multiple audio files efficiently.

Let’s discuss some of the limitations of the OpenAI Whisper API:

Scalability: The current limitations in terms of file upload size can be limiting for large-scale projects
Privacy Concerns: Uploading audio files to an external server can raise privacy issues, particularly for sensitive or confidential information. Ensuring data security requires additional measures.
Cost structure and set-up: Using the API requires you to have credits in the account. Practically, this requires to ensure you have sufficient credits to execute the speech-to-text conversion, to avoid processes being interrupted due to insufficient balance. Otherwise, costs are accessible for both one-offs and enterprise projects.
Dependency on Internet Connectivity: The transcription process relies on internet connectivity. Poor or unstable connections can interrupt the process, leading to delays or incomplete transcriptions.
Need for quality control: As impressive as the API is, there is still the need for quality control and editing of the output. Especially, when transcribing and writing in languages different than English.

How to use audio transcription (speech-to-text) in SEO projects

Leveraging speech-to-text models for audio transcription opens up numerous opportunities to enhance SEO projects. Integrating transcribed content into your organic strategy can drive significant improvements in search visibility and user engagement.

Enhance Website Content

Transcribe podcasts, interviews, or webinars to create valuable blog posts and articles. Search engines favor rich content that is frequently updated and relevant to user queries. Ensure the transcribed text is naturally readable and follows best practices for web content on-page optimisation.

Improve Accessibility

Provide transcriptions for audio and video content to meet accessibility standards and expand your audience. Captioned videos and transcribed audio help individuals with hearing impairments and cater to users who prefer reading over listening. This inclusivity boosts site engagement, brand visibility cross-platform, and organic channel performance.

Convert video library to blog posts

Many brands’ web content and video content teams are separate, which is a hindrance to the brand omnipresence. Much of the long-form content that performs well on YouTube, for instance, can be useful in web content format, too. Speech-to-text libraries and APIs can help to convert your video library to text.

Competitor analysis and research

Similarly, speech-to-text models allow you to do competitor research in formats that are not text-based, like video or podcasts. This can help you better understand the competitive landscape and tailor your strategy better for different platforms.

Speech-to-text content transformation with the Whisper API can significantly boost organic brand performance for organisations, which have a multi-platform presence.

Key Takeaways

Using OpenAI’s Whisper API, you can easily convert spoken words into text with high accuracy, supporting multiple languages and complex accents.
The article provides a detailed walkthrough for transcribing audio in Google Colab using Python, making it accessible even for beginners.
Leveraging transcriptions for website content, accessibility, long-tail keywords, voice search optimization, and metadata can significantly improve SEO performance.
Whisper API offers high accuracy, multilingual support, and ease of integration, but it has privacy and resource usage challenges.

Using OpenAI’s Whisper API for speech-to-text content transformation offers a powerful tool for improving content accessibility and brand omnipresence. This not only improves search visibility but also offers potential for engaging a broader audience.

How to transcribe audio with OpenAI’s Whisper API in Google Colab (Python)