SenseVoice部署（Windows环境）+ 简易api二次开发：实现麦克风语音识别

SenseVoice是什么

摘抄自README文档SenseVoice/README_zh.md at main · FunAudioLLM/SenseVoice (github.com)https://github.com/FunAudioLLM/SenseVoice/blob/main/README_zh.md

SenseVoice 是具有音频理解能力的音频基础模型，包括语音识别（ASR）、语种识别（LID）、语音情感识别（SER）和声学事件分类（AEC）或声学事件检测（AED）。本项目提供 SenseVoice 模型的介绍以及在多个任务测试集上的 benchmark，以及体验模型所需的环境安装的与推理方式。

对比目前主流的FastWhisper模型，在small模型上，SenseVoice额外提供了情感和事件，况且情感的识别率比一些开源的语音情感分类准确率高（虽然我觉得还是差点意思）。
况且，Se（后简称同）的识别速度比Fa快了很多，短文本（20以下）能做到百毫秒内。

缺点在于，Se的large版本是没有开源的，而Fa的三个版本均开源，目前普遍认为large和medium的生产效果差不多，消耗时间是以秒为单位的。看官方给出的表格来说，Se的large和Fa的性能几乎相同。

部署

默认会使用Anaconda或者miniconda虚拟环境

#克隆仓库
git clone https://github.com/FunAudioLLM/SenseVoice.git
cd SenseVoice

#创建虚拟环境
conda create -n sensevoice python=3.10 
conda activate sensevoice

#在虚拟环境内安装环境依赖
pip install -r requirements.txt

#注意，base环境有cuda的需要在虚拟环境重新安装配套的cuda和torch版本，修改requirements.txt，将torch等修改成如下命令

--extra-index-url https://download.pytorch.org/whl/cu118
torch==2.0.1+cu118
torchaudio==2.0.2

python webui.py

即可进入webui进行最基本的测试

一些问题

注意，首次联网加载模型会下载到用户缓存目录C:\Users\Admin\.cache\modelscope\hub\iic

复制iic文件夹到项目根目录即可离线运行

Se的small模型速度很快，web端展示表情和情景的方式是使用emoji表情，api端是使用json直接预置标签。

不过在鼻音识别（例如天/年的区别）和部分词汇识别还有一些缺陷。

例如，使用坤坤的音频作为测试，成功识别到了背景音乐的场景，但是偶像练习生的词汇有错误，不知道large模型是否会有改善。

二次开发

使用麦克风录音保存为临时文件，再调用本地api进行识别

pip install SpeechRecognition
pip install PyAudio
#安装两个语音库

#运行sensevoice api
python api.py

#运行之前记得在最后加上启动api服务
#推荐开0000hostip，因为这个可以在内网ip访问，如果127的话只能在本机测试，0000适合有多台测试机的开发者，8666端口是因为我的8000在跑ollama

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8666)

import io
import time
import wave
import requests
import speech_recognition as sr
from tqdm import tqdm
import re

class AudioRecorder:
    def __init__(self, rate=16000):
        """初始化录音器，设置采样率"""
        self.rate = rate
        self.recognizer = sr.Recognizer()

    def record(self):
        """录制音频并返回音频数据"""
        with sr.Microphone(sample_rate=self.rate) as source:
            print('请在倒计时结束前说话', flush=True)
            time.sleep(0.1)  # 确保print输出
            start_time = time.time()
            audio = None

            for _ in tqdm(range(20), desc="倒计时", unit="s"):
                try:
                    # 录音，设置超时1秒以便更新进度条
                    audio = self.recognizer.listen(source, timeout=1, phrase_time_limit=15)
                    break  # 录音成功，跳出循环
                except sr.WaitTimeoutError:
                    # 超时未检测到语音
                    if time.time() - start_time > 20:
                        print("未检测到语音输入")
                        break

            if audio is None:
                print("未检测到语音输入")
                return None

        # 返回音频数据
        audio_data = audio.get_wav_data()
        return io.BytesIO(audio_data)

    def save_wav(self, audio_data, filename="temp_output.wav"):
        """将音频数据保存为WAV文件"""
        audio_data.seek(0)
        with wave.open(filename, 'wb') as wav_file:
            nchannels = 1
            sampwidth = 2  # 16-bit audio
            framerate = self.rate  # 采样率
            comptype = "NONE"
            compname = "not compressed"
            audio_frames = audio_data.read()

            wav_file.setnchannels(nchannels)
            wav_file.setsampwidth(sampwidth)
            wav_file.setframerate(framerate)
            wav_file.setcomptype(comptype, compname)
            wav_file.writeframes(audio_frames)
        audio_data.seek(0)

    def run(self):
        """运行录音功能并保存音频文件"""
        audio_data = self.record()
        if audio_data:
            self.save_wav(audio_data, "temp_output.wav")
        return audio_data

class SenseVoice:
    def __init__(self, api_url, emo=False):
        """初始化语音识别接口，设置API URL和情感识别开关"""
        self.api_url = api_url
        self.emo = emo

    def _extract_second_bracket_content(self, raw_text):
        """提取文本中第二对尖括号内的内容"""
        match = re.search(r'<[^<>]*><([^<>]*)>', raw_text)
        if match:
            return match.group(1)
        return None

    def _get_speech_text(self, audio_data):
        """将音频数据发送到API并获取识别结果"""
        print('正在进行语音识别')
        files = [('files', audio_data)]
        data = {'keys': 'audio1', 'lang': 'auto'}

        response = requests.post(self.api_url, files=files, data=data)
        if response.status_code == 200:
            result_json = response.json()
            if "result" in result_json and len(result_json["result"]) > 0:
                if self.emo:
                    result = self._extract_second_bracket_content(result_json["result"][0]["raw_text"]) + "\n" + result_json["result"][0]["text"]
                    return result
                else:
                    return result_json["result"][0]["text"]
            else:
                return "未识别到有效的文本"
        else:
            return f"请求失败，状态码: {response.status_code}"

    def speech_to_text(self, audio_data):
        """调用API进行语音识别并返回结果"""
        return self._get_speech_text(audio_data)

# 使用示例
if __name__ == "__main__":
    recorder = AudioRecorder()
    audio_data = recorder.run()

    if audio_data:
        api_url = "http://localhost:8666/api/v1/asr"
        sense_voice = SenseVoice(api_url, emo=True)
        result = sense_voice.speech_to_text(audio_data)
        print("识别结果:", result)

将以上代码保存为一个py文件运行即可

补充

SenseVoice目前已经上架阿里云语音服务，需要服务端部署和生产环境使用可以直接考虑官方的api，况且SenseVoice的large模型暂未开源，估计很长时间只能通过付费使用了。

有空我会测试。