通过 python 第三方库不用在线语音识别服务使用whisper获得一句话在一段录音中的起始时间

要实现这个需求，可以使用Python中的音频处理库 librosa 和 numpy。

以下是代码示例：

import numpy as np
import librosa

# 加载音频文件
y, sr = librosa.load("path/to/audio/file.wav")

# 定义要识别的短语
phrase = "hello world"

# 将短语转换为音频信号
phrase_signal = librosa.core.tone(np.array([440]), duration=0.5, sr=sr)

# 在音频中查找短语的起始时间
corr = librosa.core.correlate(y, phrase_signal, mode="valid")
pos = np.argmax(corr)

# 输出短语的起始时间
start_time = librosa.samples_to_time(pos, sr=sr)
print(f"The phrase '{phrase}' starts at {start_time:.2f} seconds.")

这段代码的主要思路是先将要识别的短语转换成音频信号，然后在录音中查找该信号的匹配位置，最后将匹配位置转换为时间。

通过 python 第三方库不用在线语音识别服务使用whisper获得一句话在一段录音中的起始时间