Using AI to translate movie subtitles

[Live broadcast preview] Will large models replace programmers? "

This article introduces how to use Python to call ffmpeg and Gemini to translate movie subtitles. The effect can be found in the "Effect Display" section.

background

I left my last company not long ago and became independent again. I switched between the previous jobs almost seamlessly, without a lot of thinking. This time I decided to think about it carefully and relax and work on something that I find fun. I bought a NAS and found that my IT skills at work were finally used in my life, the first of which was about Chinese subtitles for movies.

The first step to get a NAS is to start frantically downloading 4K movies. These movies all come with subtitles, but some do not have Chinese subtitles or are not translated well. In addition, the NAS software I bought is not fully functional and downloading Chinese subtitles is troublesome, so I hope to have an automated solution. After evaluation, I think we can use the current AI such as ChatGPT and Gemini to translate English subtitles, which should have good results.

Use Poetry to manage projects

I haven't done many Python projects in the past few years, but I saw some projects using poetry , so I decided to use it in this project. The trial experience is very good, far better than the pipenv I have used before.

The contents of my pyproject.toml file are as follows:

[tool.poetry]
name = "upbox"
version = "0.1.0"
description = ""
authors = ["rocksun <[email protected]>"]
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
ffmpeg-python = "^0.2.0"
llama-index = "^0.10.25"
llama-index-llms-gemini = "^0.1.6"
pysubs2 = "^1.6.1"
# yt-dlp = "^2024.4.9"
# typer = "^0.12.3"
# faster-whisper = "^1.0.1"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

I won’t go into details about the use of poetry here, you can learn it by yourself. The packaging library of ffmpeg is quoted here (the ffmpeg command is required in the path); then there are llama-index and the corresponding Gemini library. In fact, there is not much difference between using llama-index and not. This article does not use too many functions of llama-index; finally It is the subtitle processing library pysubs2. I once considered whether to parse subtitles directly, but later I found that using pysubs2 can still save a lot of time.

English subtitle extraction

It is easy to extract subtitles embedded in a video through ffmpeg. Just execute the following command:

ffmpeg -i my_file.mkv outfile.vtt

But in fact, there will be multiple subtitles in a video, which is not accurate, so you still need to confirm. I still consider using an ffmpeg library, that is, ffmpeg-python . The code for using this library to extract English subtitles is as follows:

def _guess_eng_subtitle_index(video_path):
    probe = ffmpeg.probe(video_path)
    streams = probe['streams']
    for index, stream in enumerate(streams):
        if stream.get('codec_type') == 'subtitle' and stream.get('tags', {}).get('language') == 'eng':
            return index
    for index, stream in enumerate(streams):
        if stream['codec_type'] == 'subtitle' and stream.get('tags', {}).get('title', "").lower().find("english")!=-1 :
            return index
    return -1

def _extract_subtitle_by_index(video_path, output_path, index):
    return ffmpeg.input(video_path).output(output_path, map='0:'+str(index)).run()

def extract_subtitle(video_path, en_subtitle_path):
    # get the streams from video with ffprobe
    index = _guess_eng_subtitle_index(video_path)
    if index == -1:
        return -1
    
    return _extract_subtitle_by_index(video_path, en_subtitle_path, index)

A method has been added _guess_eng_subtitle_indexto determine the index of English subtitles. This is because although the subtitle tags of most videos are relatively standardized, there are indeed some videos whose subtitles do not have tags at all, so I can only guess. I guess there are still some in practice. Other situations can only be dealt with based on the actual situation.

English subtitle processing

At first, I thought it would be enough to just throw the subtitles to Gemini and save the results, but that didn’t actually work. There were several problems:

There are many tags in many English subtitles, which will affect the effect when translating
If a subtitle is too big, Gemini cannot handle it, and if the context is too long, problems can arise.
The timestamp in the subtitles is too long, making the prompt too long.

For this reason, I had to add a subtitle class UpSubs to deal with the above problems:

class UpSubs:
    def __init__(self, subs_path):
        self.subs = pysubs2.load(subs_path)

    def get_subtitle_text(self):
        text = ""
        for sub in self.subs:
            text += sub.text + "\n\n"
        return text

    def get_subtitle_text_with_index(self):
        text = ""
        for i, sub in enumerate(self.subs):
            text += "chunk-"+str(i) + ":\n" + sub.text.replace("\\N", " ") + "\n\n"
        return text
    
    def save(self, output_path):
        self.subs.save(output_path)

    def clean(self):
        indexes = []
        for i, sub in enumerate(self.subs):
            # remove xml tag and line change in sub text
            sub.text = re.sub(r"<[^>]+>", "", sub.text)
            sub.text = sub.text.replace("\\N", " ")

    def fill(self, text):
        text = text.strip()
        pattern = r"\n\s*\n"
        paragraphs = re.split(pattern, text)
        for para in paragraphs:
            try:
                firtline = para.split("\n")[0]
                countstr = firtline[6:len(firtline)-1]
                # print(countstr)
                index = int(countstr)
                p = "\n".join(para.split("\n")[1:])
                self.subs[index].text = p
            except Exception as e:
                print(f"Error merge paragraph : \n {para} \n with exception: \n {e}")
                raise(e)
    
    def merge_dual(self, subspath):
        second_subs = pysubs2.load(subspath)
        merged_subs = SSAFile()
        if len(self.subs.events) == len(second_subs.events):            
            for i, first_event in enumerate(self.subs.events):
                second_event = second_subs[i]
                if first_event.text == second_event.text:
                    merged_event = SSAEvent(first_event.start, first_event.end, first_event.text)
                else:
                    merged_event = SSAEvent(first_event.start, first_event.end, first_event.text + '\n' + second_event.text)
                merged_subs.append(merged_event)
            return merged_subs
        
        return None

cleanThe method can simply clean up subtitles; the save method can be used to save subtitles; merge_dualit can be used to merge bilingual subtitles. These are relatively simple, and we will focus on the processing of subtitle text later.

The original srt file format is as follows:

12
00:02:30,776 --> 00:02:34,780
Not even the great Dragon Warrior.

13
00:02:43,830 --> 00:02:45,749
Oh, where is Po?

14
00:02:45,749 --> 00:02:48,502
He was supposed to be here hours ago.

The method will get_subtitle_text_with_indexbecome:

chunk-12
Not even the great Dragon Warrior.

chunk-13
Oh, where is Po?

chunk-14
He was supposed to be here hours ago.

This is done to reduce the number of words and chunks. Moreover, the number of each subtitle can still be tracked. Through fillthe method, we can restore the subtitles from the translated text.

Call Gemini

There are several problems with calling Gemini:

access key required
Domestic visits require a suitable agent
Must have certain fault tolerance
There is also the need to circumvent Gemini’s security mechanism.

Therefore, a special completemethod was written to address these problems:

def complete(prompt, max_tokens=32760):
    prompt = prompt.strip()
    if not prompt:
        return ""
    
    safety_settings = [
        {
            "category": "HARM_CATEGORY_HARASSMENT",
            "threshold": "BLOCK_NONE"
        },
        {
            "category": "HARM_CATEGORY_HATE_SPEECH",
            "threshold": "BLOCK_NONE"
        },
        {
            "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
            "threshold": "BLOCK_NONE"
        },
        {
            "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
            "threshold": "BLOCK_NONE"
        },
    ]

    retries = 3
    for _ in range(retries):
        try:
            return Gemini(max_tokens=max_tokens, safety_settings=safety_settings, temperature = 0.01).complete(prompt).text
        except Exception as e:
            print(f"Error completing prompt: {prompt} \n with error: \n ")
            traceback.print_exc()
    return ""

safety_settingsIt is very important that some particularly sensitive language often appears in movie subtitles, and Gemini must be informed to tolerate it as much as possible. Although according to the documentation, only paid accounts can do it BLOCK_NONE, it seems that I didn’t encounter many problems with the above configuration when translating movies. I would occasionally encounter some, but they would disappear if I tried again.

Then 3 retries are added. The call will occasionally fail. Retrying can solve some problems.

Finally, the API Key can be obtained through Google AI Studio . Then add a .env file to the project:

http_proxy=http://192.168.0.107:7890
https_proxy=http://192.168.0.107:7890
GOOGLE_API_KEY=[your-api-key]

The program can read the API Key and proxy settings.

Calling process

Let’s look at the outermost tran_subtitlesmethod first

def tran_subtitles(fixed_subtitle, zh_subtitle=None, cncf = False, chunk_size=3000):
    subtitle_base = os.path.splitext(fixed_subtitle)[0]
    video_base = os.path.splitext(subtitle_base)[0]
    if zh_subtitle is None:
        zh_subtitle = video_base + ".zh-fixed.vtt"
    if os.path.exists(zh_subtitle):
        print(f"zh subtitle {zh_subtitle} already translated, skip to translate.")
        return 1

    prompt_tpl = MOVIE_TRAN_PROMPT_TPL
    opts = { }

    srtp = UpSubs(fixed_subtitle)
    text = srtp.get_subtitle_text_with_index()

    process_text(srtp, text, prompt_tpl, opts, chunk_size = chunk_size)
    srtp.save(zh_subtitle)

    return zh_subtitle

The logic is relatively simple. Read the English subtitles, use get_subtitle_text_with_indexthe method to convert them into text to be translated, and then execute the process_text method to complete the translation. The prompt word template prompt_tpl directly references MOVIE_TRAN_PROMPT_TPL, which contains:

MOVIE_TRAN_PROMPT_TPL = """你是个专业电影字幕翻译，你需要将一份英文字幕翻译成中文。
[需要翻译的英文字幕]:

{content}

# [中文字幕]:"""

You can see that this prompt is quite simple.

Then you can pay attention to the following process_textmethods:

def process_text(subs, text, prompt_tpl, opts, chunk_size=2500):
    # ret = ""
    chunks = _split_subtitles(text, chunk_size)
    for(i, chunk) in enumerate(chunks):
        print("process chunk ({}/{})".format(i+1,len(chunks)))
        # if i==4:
        #     break
        # format string with all the field in a dict 
        opts["content"] = chunk
        prompt = prompt_tpl.format(**opts)

        print(prompt)
        out = complete(prompt, max_tokens=32760)
        subs.fill(out)
        print(out)

Split the subtitle text into multiple chunks through _split_subtitlesthe method, and then throw them to the method mentioned above complete.

Show results

At first, I didn’t have much expectations for the translation of subtitles, but the final effect was surprisingly good. Taking Kung Fu Panda 4 as an example, this is a comparison of some translations:

English subtitles:

10
00:02:22,184 --> 00:02:27,606
Let it be known from the highest mountain
to the lowest valley that Tai Lung lives,

11
00:02:27,606 --> 00:02:30,776
and no one will stand in his way.

12
00:02:30,776 --> 00:02:34,780
Not even the great Dragon Warrior.

13
00:02:43,830 --> 00:02:45,749
Oh, where is Po?

Chinese subtitle:

10
00:02:22,184 --> 00:02:27,606
让最高的山峰和最低的山谷都知道，泰隆还活着，

11
00:02:27,606 --> 00:02:30,776
没人能阻挡他。

12
00:02:30,776 --> 00:02:34,780
即使是伟大的神龙大侠也不行。

13
00:02:43,830 --> 00:02:45,749
哦，阿宝在哪儿？

The results were surprisingly good. My prmopt did not provide more context, but Gemini gave an authentic translation.

Summarize

For movies, the above code runs relatively stably. But when faced with some subtitles that are not very good in themselves, the translation results are not very good either, and there are many anomalies, so a lot of improvements need to be made. Recently, my video account (Yunyunzhongshengs) shared some technical videos, which were implemented with improved code, and I will share them with you later.

I feel like it’s great to be able to use technology to change lives. I hope there will be more opportunities for improvement. Everyone is welcome to pay more attention and communicate.

This article was first published on Yunyunzhongsheng ( https://yylives.cc/ ), everyone is welcome to visit.