whisper:robust speech recognition via large-sacle weak supervision

OpenAI Whisper Intensive Reading [Paper Intensive Reading 45]_哔哩哔哩_bilibili More papers: https://github.com/mli/paper-reading, 68331 video views, 327 bullet screens, 2332 likes, votes The number of coins is 1192, the number of favorites is 983, and the number of reposts is 394. The author of the video learns AI from Li Mu, the author's introduction, related videos: automatic recognition lectures, interview recordings - Python open source whisper speech recognition, ChatGPT principle analysis Li Hongyi, (strongly recommended) Li Hongyi 2021/2022 Spring Machine Learning Course, intensive reading of Transformer papers paragraph by paragraph [Paper Intensive Reading], Ph. A small tool for cutting videos [Paper Intensive Reading 44], a "miracle with great efforts" in the field of algorithms: ChatGPT! Professor Li Hongyi explained the underlying logic of ChatGPT! Interested students quickly bookmark and study! , 46 Semantic Segmentation and Dataset [Hands-on Deep Learning v2] https://www.bilibili.com/video/BV1VG4y1t74x/?spm_id_from=333.999.0.0&vd_source=4aed82e35f26bb600bc5b46e65e25c22

The pre-training of the speech model, the way of self-supervision, this article is still very interesting, it is actually the application of bert on the previous nlp in the field of speech recognition.

        Crawled 70w labeled speech data on the Internet, and then directly trained a transformer model. For unlabeled speech data, use the pre-trained model of comparative learning. These pre-trained speech encoders can learn relatively high-quality feature representations, but there is no good encoder. If you want to use it, you still have to find A label data, fine-tuning, is actually training a decoder, but whisper thinks that fine-tuning is more complicated. This idea is the same as the previous bert. bert is bidirectional. The encoder module in the transformer used is essentially a pre-trained large language model. The pre-trained cloze or predicting the next sentence is used during training. The task, this gpt is different, gpt is generative, and the deocder module in the transformer is used, which itself is a decoder, and there is no need to fine-tune a decoder in the follow-up like bert. But why not use the gpt mode for speech recognition? Because the speech signal is a sound wave, it can only predict the next second after it is placed in gpt, but predicting the sound wave itself is different from predicting words, and there is a need to convert the speech signal into a text signal in the middle, so this step still requires label data. In other words, even if it is an unsupervised task, a fine-tuning is still required in the follow-up. But of course it can be done in one step, but fine-tuning on specific data is always not robust enough, it is best to use zero shot.

        The author made a weakly supervised dataset. Although it is supervised, the data quality is relatively poor. There are 68w hours, and a large transformer is used. When the model is large enough, it is beneficial to multilingual and multitasking. This method does not require self-supervision. In the past, the self-supervised data was often more than 100w hours, and then 4w supervised fine-tuning was used. Now whisper directly expands the 4w labeled data into 68w weakly supervised data, and the effect is very good. At present, this is similar to sam, and the image field can also do the same.

        Whisper relies entirely on the sequence to sequence method to predict the original text. However, some preprocessing is still required for the data crawled from the Internet. First, if there is a voice-text pair generated by an asr machine in the crawled data, this should be deleted. Cut all the data into an interval of 30s as training data.

        Whisper uses a transformer with an encoder and a decoder. The data input is to sample the audio to 16000Hz, and then turn it into an 80-channel log scale mel spectrogram. 16000Hz is 16000 points in each time, and each time point will be There is a value, do a Fourier transformation, change the time series to the frequency spectrum, and log scale to change the frequency spectrum dimension from log to db. Mel is because everyone responds differently to different frequencies, and generally responds well to low frequencies. One point, the response to high frequencies is a bit worse, mel is a spectrogram, but the resolution of high frequencies is made lower, and the resolution of low frequencies is made higher. That is to say, a time-series signal becomes a 2D frequency map, and a feature is extracted at each time point, and 80 dimensions represent each time point. An 80-dimensional feature is extracted, and each time it slides forward for 10ms, it takes 30s to cut out. A paragraph, then a 30s speech signal finally becomes a 3000 data points, each dimension is 80 dimensions.

Network model structure:

Model parameters:

 Multilingual results:

The Chinese effect is average. The left picture is the word error rate, and the horizontal axis is the training data. There is a lot of zh Chinese data but the error rate is still quite high. The right picture is the translation, and the translations are all in English.

Guess you like

Origin blog.csdn.net/u012193416/article/details/130180826