问题：

谷歌云语音到文本在某些IDEVICE上无法正确转录流式音频

越望

2023-03-14

过去几周，我使用实时流音频实现了谷歌云语音到文本API。虽然一开始一切看起来都很好，但最近我在更多的设备上测试了该产品，发现在某些想法上存在一些奇怪的不规则之处。首先，以下是相关代码：

前端（反应组件）

constructor(props) {
  super(props);
  this.audio = props.audio;
  this.socket = new SocketClient();
  this.bufferSize = 2048;
}

/**
* Initializes the users microphone and the audio stream.
*
* @return {void}
*/
startAudioStream = async () => {
  const AudioContext = window.AudioContext || window.webkitAudioContext;
  this.audioCtx = new AudioContext();
  this.processor = this.audioCtx.createScriptProcessor(this.bufferSize, 1, 1);
  this.processor.connect(this.audioCtx.destination);
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  /* Debug through instant playback:
  this.audio.srcObject = stream;
  this.audio.play();
  return; */

  this.globalStream = stream;
  this.audioCtx.resume();
  this.input = this.audioCtx.createMediaStreamSource(stream);
  this.input.connect(this.processor);

  this.processor.onaudioprocess = (e) => {
    this.microphoneProcess(e);
  };
  this.setState({ streaming: true });
}

/**
 * Processes microphone input and passes it to the server via the open socket connection.
 *
 * @param {AudioProcessingEvent} e
 * @return {void}
 */
microphoneProcess = (e) => {
  const { speaking, askingForConfirmation, askingForErrorConfirmation } = this.state;
  const left = e.inputBuffer.getChannelData(0);
  const left16 = Helpers.downsampleBuffer(left, 44100, 16000);
  if (speaking === false) {
    this.socket.emit('stream', {
      audio: left16,
      context: askingForConfirmation || askingForErrorConfirmation ? 'zip_code_yes_no' : 'zip_code',
      speechContext: askingForConfirmation || askingForErrorConfirmation ? ['ja', 'nein', 'ne', 'nö', 'falsch', 'neu', 'korrektur', 'korrigieren', 'stopp', 'halt', 'neu'] : ['$OPERAND'],
    });
  }
}

助手（下采样缓冲器）

/**
 * Downsamples a given audio buffer from sampleRate to outSampleRate.
 * @param {Array} buffer The audio buffer to downsample.
 * @param {number} sampleRate The original sample rate.
 * @param {number} outSampleRate The new sample rate.
 * @return {Array} The downsampled audio buffer.
 */
static downsampleBuffer(buffer, sampleRate, outSampleRate) {
  if (outSampleRate === sampleRate) {
    return buffer;
  }
  if (outSampleRate > sampleRate) {
    throw new Error('Downsampling rate show be smaller than original sample rate');
  }
  const sampleRateRatio = sampleRate / outSampleRate;
  const newLength = Math.round(buffer.length / sampleRateRatio);
  const result = new Int16Array(newLength);
  let offsetResult = 0;
  let offsetBuffer = 0;
  while (offsetResult < result.length) {
    const nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);
    let accum = 0;
    let count = 0;
    for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
      accum += buffer[i];
      count++;
    }

    result[offsetResult] = Math.min(1, accum / count) * 0x7FFF;
    offsetResult++;
    offsetBuffer = nextOffsetBuffer;
  }
  return result.buffer;
}

后端（套接字服务器）

io.on('connection', (socket) => {
  logger.debug('New client connected');
  const speechClient = new SpeechService(socket);

  socket.on('stream', (data) => {
    const audioData = data.audio;
    const context = data.context;
    const speechContext = data.speechContext;
    speechClient.transcribe(audioData, context, speechContext);
  });
});

后端（语音客户端/转录功能，将数据发送到GCloud）

async transcribe(data, context, speechContext, isFile = false) {
  if (!this.recognizeStream) {
    logger.debug('Initiating new Google Cloud Speech client...');
    let waitingForMoreData = false;
    // Create new stream to the Google Speech client
    this.recognizeStream = this.speechClient
      .streamingRecognize({
        config: {
          encoding: 'LINEAR16',
          sampleRateHertz: 16000,
          languageCode: 'de-DE',
          speechContexts: speechContext ? [{ phrases: speechContext }] : undefined,
        },
        interimResults: false,
        singleUtterance: true,
      })
      .on('error', (error) => {
        if (error.code === 11) {
          this.recognizeStream.destroy();
          this.recognizeStream = null;
          return;
        }
        this.socket.emit('error');
        this.recognizeStream.destroy();
        this.recognizeStream = null;
        logger.error(`Received error from Google Cloud Speech client: ${error.message}`);
      })
      .on('data', async (gdata) => {
        if ((!gdata.results || !gdata.results[0]) && gdata.speechEventType === 'END_OF_SINGLE_UTTERANCE') {
          logger.debug('Received END_OF_SINGLE_UTTERANCE - waiting 300ms for more data before restarting stream');
          waitingForMoreData = true;
          setTimeout(() => {
            if (waitingForMoreData === true) {
              // User was silent for too long - restart stream
              this.recognizeStream.destroy();
              this.recognizeStream = null;
            }
          }, 300);
          return;
        }
        waitingForMoreData = false;
        const transcription = gdata.results[0].alternatives[0].transcript;
        logger.debug(`Transcription: ${transcription}`);

        // Emit transcription and MP3 file of answer
        this.socket.emit('transcription', transcription);
        const filename = await ttsClient.getAnswerFromTranscription(transcription, 'fairy', context); // TODO-Final: Dynamic character
        if (filename !== null) this.socket.emit('speech', `${config.publicScheme}://${config.publicHost}:${config.publicPort}/${filename}`);

        // Restart stream
        if (this.recognizeStream) this.recognizeStream.destroy();
        this.recognizeStream = null;
      });
  }
  // eslint-disable-next-line security/detect-non-literal-fs-filename
  if (isFile === true) fs.createReadStream(data).pipe(this.recognizeStream);
  else this.recognizeStream.write(data);
}

现在，在我测试的设备中，行为差异很大。我最初是在iMac 2017上使用谷歌Chrome作为浏览器开发的。工作起来很有魅力。然后，在iPhone11专业版和iPad Air 4上进行了测试，无论是Safari还是全屏网络应用。同样，工作起来很有魅力。

后来，我尝试了iPad Pro 12.9”2017。突然，谷歌云有时根本不返回转录，有时它会返回一些只使用非常幻想的东西，听起来像实际口语文本的东西。iPad 5和iPhone6 Plus上的相同行为。

我真的不知道接下来该怎么做。至少到目前为止，我读到的是，iPhone6s（不幸的是，我对iPad一无所知）的硬件采样率从44.1khz更改为48khz。所以我想，可能就是这样了，在代码中到处都是采样率，没有成功。此外，我注意到我的带有谷歌Chrome的iMac也在44.1khz上运行，就像转录不起作用的“旧”iPad一样。同样，新的iPad在48khz上运行——这里一切正常。所以这不可能。

我还注意到：当我将一些Airpod连接到“坏”的设备并将其用作音频输入时，一切都会恢复正常。因此，这一定与这些设备的内部麦克风的处理有关。我只是不知道到底是什么。

谁能把我引向正确的方向？在音频和麦克风方面，这几代设备之间发生了哪些变化？

更新1：我现在已经实现了一个快速功能，可以使用node-wav将流PCM数据从前端写入后端的文件。我想，我现在越来越近了——在语音识别变得疯狂的设备上，我听起来像一只花栗鼠（音调极高）。我还注意到二进制音频数据的流动速度比一切正常的设备慢得多。所以这可能与样本/码率、编码或其他有关。不幸的是，我不是音频专家，所以不知道下一步该怎么办。

更新2：在经历了大量的试验结束错误之后，我发现如果我在Google Cloud中将采样率设置为大约9500到10000，那么一切都会正常进行。当我将此设置为节点wav文件输出的采样率时，听起来也不错。如果我再次将“传出”采样率重置为GCloud至16000，并将音频输入从前端的44100降低至25000左右，而不是16000左右（请参阅“microphoneProcess”功能中的“frontend（React Component）”），它也会起作用。因此，样本率差异中似乎存在某种约0.6的因素。然而，我仍然不知道这种行为是从哪里来的：iMac上的Chrome和iPad上的Safari都有音频上下文。44100的采样器。因此，当我在代码中将它们的样本减少到16000时，我认为两者都应该工作，而只有iMac工作。iPad内部的采样率似乎有所不同？

云镜

2023-03-14

经过大量的尝试和错误，我找到了问题（和解决方案）。似乎“较旧”的iDevice机型（如2017款iPad Pro）有一些奇怪的特性，即自动将麦克风采样率调整为播放音频的速率。尽管这些设备的硬件采样率设置为44.1khz，但只要播放一些音频，采样率就会改变。这可以通过以下方式观察到：

const audioCtx = new webkitAudioContext();
console.log(`Current sample rate: ${audioCtx.sampleRate}`); // 44100
const audio = new Audio();
audio.src = 'some_audio.mp3';
await audio.play();
console.log(`Current sample rate: ${audioCtx.sampleRate}`); // Sample rate of the played audio

在我的例子中，在打开语音转录套接字之前，我播放了一些从谷歌文本到语音的合成语音。这些声音文件的采样率为24khz，正是Google Cloud接收到我的音频输入的采样率。

因此，解决方案是——无论如何我都应该这样做——将所有内容的采样率降低到16khz（参见问题中的助手函数），但不是从硬编码的44.1khz，而是从音频上下文的当前采样率。因此，我将microhoneprocess（）函数更改如下：

const left = e.inputBuffer.getChannelData(0);
const left16 = Helpers.downsampleBuffer(left, this.audioCtx.sampleRate, 16000);

结论：不要相信Safari的页面加载采样率。它可能会改变。

谷歌云语音到文本在某些IDEVICE上无法正确转录流式音频

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档