Building a YouTube Transcript Extractor with Python

Have you ever wanted to extract the transcript from a YouTube video for note-taking, analysis, or content creation? In this tutorial, I’ll walk you through building a simple yet powerful Python script that does exactly that.

What We’re Building

Our script will:

Accept any YouTube URL as input
Extract the video transcript automatically
Display the transcript in the console
Optionally save it to a text file

Prerequisites

Before we start, make sure you have:

Python 3.6 or higher installed
Basic understanding of Python syntax
pip (Python package manager)

Installation

First, we need to install the required library:

pip install youtube-transcript-api

This library handles all the heavy lifting of communicating with YouTube’s API to fetch transcripts.

The Complete Code

Here’s the full script we’ll be building:

"""
YouTube Transcript Extractor
Extracts and saves transcripts from YouTube videos in plain text format.
"""

import re
import sys

# Import the YouTube Transcript API
try:
    from youtube_transcript_api import YouTubeTranscriptApi
    print("✓ YouTubeTranscriptApi imported successfully")
except ImportError as e:
    print(f"ERROR: Cannot import YouTubeTranscriptApi: {e}")
    print("\nPlease install: pip install youtube-transcript-api")
    sys.exit(1)

try:
    from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound
    print("✓ Error classes imported successfully")
except ImportError:
    print("⚠ Warning: Could not import error classes (older version?)")
    class TranscriptsDisabled(Exception):
        pass
    class NoTranscriptFound(Exception):
        pass


def extract_video_id(url):
    """Extract video ID from various YouTube URL formats."""
    patterns = [
        r'(?:youtube\.com\/watch\?v=|youtu\.be\/)([^&\n?#]+)',
        r'youtube\.com\/embed\/([^&\n?#]+)',
        r'youtube\.com\/v\/([^&\n?#]+)'
    ]
    
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    
    return None


def get_transcript(video_url):
    """Fetch transcript for a YouTube video."""
    video_id = extract_video_id(video_url)
    
    if not video_id:
        print("Error: Invalid YouTube URL")
        return None
    
    print(f"✓ Extracted video ID: {video_id}")
    
    try:
        print("→ Fetching transcript...")
        
        # Create an instance and fetch transcript
        api = YouTubeTranscriptApi()
        transcript_list = api.fetch(video_id)
        
        print(f"✓ Found {len(transcript_list)} transcript segments")
        
        # Combine all text segments
        transcript_text = ' '.join([entry.text for entry in transcript_list])
        
        return transcript_text
    
    except TranscriptsDisabled:
        print(f"✗ Error: Transcripts are disabled for this video (ID: {video_id})")
        return None
    
    except NoTranscriptFound:
        print(f"✗ Error: No transcript found for this video (ID: {video_id})")
        print("   This video may not have captions available.")
        return None
    
    except AttributeError as e:
        print(f"✗ Error: API method not found - {str(e)}")
        print(f"   Available methods: {dir(YouTubeTranscriptApi)}")
        return None
    
    except Exception as e:
        print(f"✗ Error: {type(e).__name__}: {str(e)}")
        return None


def save_transcript(transcript, filename="transcript.txt"):
    """Save transcript to a text file."""
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(transcript)
        print(f"\n✓ Transcript saved to: {filename}")
    except Exception as e:
        print(f"✗ Error saving file: {str(e)}")


def main():
    print("\n" + "="*50)
    print("YouTube Transcript Extractor")
    print("="*50 + "\n")
    
    # Get URL from user
    if len(sys.argv) > 1:
        video_url = sys.argv[1]
    else:
        video_url = input("Enter YouTube URL: ").strip()
    
    if not video_url:
        print("✗ Error: No URL provided")
        return
    
    print(f"\nProcessing: {video_url}\n")
    
    # Get transcript
    transcript = get_transcript(video_url)
    
    if transcript:
        print("\n" + "="*50)
        print("TRANSCRIPT")
        print("="*50 + "\n")
        print(transcript)
        print("\n" + "="*50)
        
        # Ask if user wants to save
        save_choice = input("\nSave transcript to file? (y/n): ").strip().lower()
        if save_choice == 'y':
            filename = input("Enter filename (default: transcript.txt): ").strip()
            if not filename:
                filename = "transcript.txt"
            if not filename.endswith('.txt'):
                filename += '.txt'
            save_transcript(transcript, filename)
    else:
        print("\n✗ Failed to retrieve transcript")


if __name__ == "__main__":
    main()

Code Breakdown

Let’s break down each component to understand how it works.

1. Importing Required Libraries

import re
import sys
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api._errors import TranscriptsDisabled, NoTranscriptFound

re: Regular expressions module for pattern matching (used to extract video IDs)
sys: System-specific parameters (used for command-line arguments and exit codes)
YouTubeTranscriptApi: The main class for fetching transcripts
Error classes: Specific exceptions for handling transcript-related errors

2. Extracting the Video ID

def extract_video_id(url):
    """Extract video ID from various YouTube URL formats."""
    patterns = [
        r'(?:youtube\.com\/watch\?v=|youtu\.be\/)([^&\n?#]+)',
        r'youtube\.com\/embed\/([^&\n?#]+)',
        r'youtube\.com\/v\/([^&\n?#]+)'
    ]
    
    for pattern in patterns:
        match = re.search(pattern, url)
        if match:
            return match.group(1)
    
    return None

This function handles different YouTube URL formats:

Standard: https://www.youtube.com/watch?v=VIDEO_ID
Short: https://youtu.be/VIDEO_ID
Embed: https://www.youtube.com/embed/VIDEO_ID

The regex patterns extract the 11-character video ID from these URLs.

3. Fetching the Transcript

def get_transcript(video_url):
    video_id = extract_video_id(video_url)
    
    if not video_id:
        print("Error: Invalid YouTube URL")
        return None
    
    try:
        api = YouTubeTranscriptApi()
        transcript_list = api.fetch(video_id)
        transcript_text = ' '.join([entry.text for entry in transcript_list])
        return transcript_text
    except TranscriptsDisabled:
        print("Error: Transcripts are disabled for this video")
        return None
    except NoTranscriptFound:
        print("Error: No transcript found for this video")
        return None

Key steps:

Extract the video ID from the URL
Create an instance of the API
Fetch the transcript segments
Join all text segments into a single string
Handle common errors gracefully

The transcript_list contains multiple segments (each with text and timestamp). We use a list comprehension to extract just the text: [entry.text for entry in transcript_list]

4. Saving to a File

def save_transcript(transcript, filename="transcript.txt"):
    """Save transcript to a text file."""
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(transcript)
        print(f"\n✓ Transcript saved to: {filename}")
    except Exception as e:
        print(f"✗ Error saving file: {str(e)}")

This function uses Python’s context manager (with statement) to safely write the transcript to a file. The encoding='utf-8' ensures proper handling of special characters.

5. The Main Function

def main():
    # Get URL from command line or user input
    if len(sys.argv) > 1:
        video_url = sys.argv[1]
    else:
        video_url = input("Enter YouTube URL: ").strip()
    
    # Fetch and display transcript
    transcript = get_transcript(video_url)
    
    if transcript:
        print(transcript)
        
        # Offer to save
        save_choice = input("\nSave transcript to file? (y/n): ").strip().lower()
        if save_choice == 'y':
            filename = input("Enter filename (default: transcript.txt): ").strip()
            if not filename:
                filename = "transcript.txt"
            save_transcript(transcript, filename)

The main function orchestrates everything:

Gets the YouTube URL (from command line or user input)
Fetches the transcript
Displays it in the console
Offers to save it to a file

Usage Examples

Interactive Mode

Simply run the script and enter the URL when prompted:

python transcript_generator.py

Command-Line Mode

Pass the URL as an argument:

python transcript_generator.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Example Output

==================================================
YouTube Transcript Extractor
==================================================

✓ Extracted video ID: dQw4w9WgXcQ
→ Fetching transcript...
✓ Found 1254 transcript segments

==================================================
TRANSCRIPT
==================================================

Never gonna give you up, never gonna let you down...

==================================================

Save transcript to file? (y/n): y
Enter filename (default: transcript.txt): my_transcript.txt

✓ Transcript saved to: my_transcript.txt

Limitations and Considerations

Important notes:

This only works for videos with transcripts enabled (auto-generated or manual captions)
Some videos have transcripts disabled by their creators
Private or age-restricted videos may not be accessible
The transcript doesn’t include timestamps (but you can modify the code to include them)

Possible Enhancements

Want to take this project further? Here are some ideas:

Add timestamp support: Include timestamps alongside the text
Multiple languages: Fetch transcripts in different languages
Batch processing: Process multiple videos at once
GUI interface: Create a simple graphical interface using tkinter
Format options: Export to PDF, DOCX, or JSON formats
Search functionality: Search for specific words or phrases in the transcript

Conclusion

You’ve now built a functional YouTube transcript extractor! This script demonstrates several important programming concepts:

API interaction
Regular expressions
Error handling
File I/O operations
Command-line argument processing

Feel free to modify and extend this script for your own needs. Happy coding!

Source code: The complete code is available as a single Python file that you can copy and use immediately.

Questions or improvements? Leave a comment below or reach out. I’d love to hear how you’re using this script!

What We’re Building#

Prerequisites#

Installation#

The Complete Code#

Code Breakdown#

1. Importing Required Libraries#

2. Extracting the Video ID#

3. Fetching the Transcript#

4. Saving to a File#

5. The Main Function#

Usage Examples#

Interactive Mode#

Command-Line Mode#

Example Output#

Limitations and Considerations#

Possible Enhancements#

Conclusion#