Android 讯飞语音识别 —— (WebAPI开启动态修正识别)

韩经武
2023-12-01

官方文档:https://www.xfyun.cn/doc/asr/voicedictation/API.html#%E6%8E%A5%E5%8F%A3%E8%B0%83%E7%94%A8%E6%B5%81%E7%A8%8B

 

       讯飞WebAPI语音听写流式接口,用于1分钟内的即时语音转文字技术,支持实时返回识别结果,达到一边上传音频一边获得识别文本的效果。开启动态修正的好处是能提高识别效果的准确度。  官方网站控制台在线测试的URL:https://www.xfyun.cn/services/voicedictation,拿网页上的识别效果跟不开启动态修正的识别结果进行对比,会发现如果不开启动态识别,那么识别出的结果和网页上的结果相差甚远。并且能清楚的看到网页上已经输出好的文字会变化,很明显网页的结果是开启动态识别的。

 

代码中在握手成功后的第一帧请求时带上动态修正参数后(dwa=wpgs),说同一段声音源,对比控制台和代码的识别结果:

fun firstFrame(
    audio: String,
    @LanguageCode
    language: String
): RecognizeRequest {
    return RecognizeRequest(
        common = Common(APP_ID),
        business = Business(
            language = when (language) {
                LanguageCode.JP -> LANGUAGE_JP
                LanguageCode.EN -> LANGUAGE_EN
                LanguageCode.CN -> LANGUAGE_CN
                else -> error("Unsupported language.")
            },
            dwa = "wpgs"
        ),
        data = RequestData(
            status = STATUS_FIRST,
            audio = audio
        )
    )
}

 

数据源原文:语音分析,自然语言,处理内容审核图像识别,人脸识别,文字识别,语音硬件,医疗服务,基础服务。

控制台输出结果:语音分析,自然语言处理内容审核图像识别,人脸识别,文字识别,语音硬件,医疗服务,技术服务。

 

代码中打印识别的response结果:

2021-03-01 15:56:16.026 21006-21385/co.logre V/dynamicResult: dynamicResult: 语音

2021-03-01 15:56:16.478 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析

2021-03-01 15:56:17.123 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析自然

2021-03-01 15:56:17.281 21006-21108/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言

2021-03-01 15:56:17.450 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理

2021-03-01 15:56:18.085 21006-21394/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀

2021-03-01 15:56:18.404 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容

2021-03-01 15:56:18.568 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容是

2021-03-01 15:56:18.879 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核

2021-03-01 15:56:19.521 21006-21107/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像

2021-03-01 15:56:19.839 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别

2021-03-01 15:56:20.526 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸

2021-03-01 15:56:20.640 21006-21410/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别

2021-03-01 15:56:21.440 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别文字

2021-03-01 15:56:21.602 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别文字的

2021-03-01 15:56:22.562 21006-21386/co.logre V/dynamicResult: dynamicResult: 语音分析,自然语言处理,内容审核,图像识别,人脸识别文字识别

2021-03-01 15:56:22.901 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音

2021-03-01 15:56:23.060 21006-21404/co.logre V/dynamicResult: dynamicResult: 语音的

2021-03-01 15:56:23.396 21006-21404/co.logre V/dynamicResult: dynamicResult: 语音

2021-03-01 15:56:23.545 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音邮件

2021-03-01 15:56:24.029 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要

2021-03-01 15:56:24.187 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要服务员

2021-03-01 15:56:24.506 21006-21386/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要扶

2021-03-01 15:56:24.972 21006-21405/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要扶自己祝福

2021-03-01 15:56:25.915 21006-21109/co.logre V/dynamicResult: dynamicResult: ,语音硬件医疗服务及服务

2021-03-01 15:56:31.698 21006-21387/co.logre V/dynamicResult: dynamicResult:

 

从日志能发现,动态修正识别的颗粒度更加精细化,在识别的过程中会有较为精确的翻译,下面贴出具体的json值去观察分析(ps:解析用的是vad版本的json,即非动态结果的json去解析,动态修正的json跟不用动态的json相比,多了“pgs”,"rg",少了“vad”,多出的这两个参数跟识别的关系并不密切,故下面的json结果不影响识别):

package co.logre.service.dto

import android.util.Log
import com.squareup.moshi.Json
import com.squareup.moshi.JsonClass

@JsonClass(generateAdapter = true)
data class RecognizeResponse(
    // 会话的id,只在握手成功后第一帧请求时返回
    @Json(name = "sid") val sid: String?,
    // 返回码,0表示成功,其它表示异常
    @Json(name = "code") val code: Int,
    // 错误描述
    @Json(name = "message") val message: String?,
    // 听写结果
    @Json(name = "data") val data: ResponseData?
)

@JsonClass(generateAdapter = true)
data class ResponseData(
    // 识别结果是否结束标识
    @Json(name = "status") val status: Int,
    // 听写识别结果
    @Json(name = "result") val result: ResponseResult?
)

fun ResponseData?.linkResults(): String {
    this ?: return "<NoData>"
    result ?: return "<NoResult>"
    return result.words.flatMap { it.unitList }.joinToString(separator = "") { it.text }
}

@JsonClass(generateAdapter = true)
data class ResponseResult(
    // 起始的端点帧偏移值
    @Json(name = "bg") val beginInFrame: Int,
    // 结束的端点帧偏移值
    @Json(name = "ed") val endInFrame: Int,
    // 返回结果的序号
    @Json(name = "sn") val sequenceNumber: Int,
    // 是否是最后一片结果
    @Json(name = "ls") val lastSection: Boolean,
    // 听写结果
    @Json(name = "ws") val words: List<Word>,
    // Vad Info, vinfo = 1时生效
    @Json(name = "vad") val vad: VadResult?,
    @Json(name = "pgs") val pgs: String
)

fun ResponseData.vadInfo(): VadInfo? {
    val result = this.result?.vad?.results?.singleOrNull()
    if (result == null) {
        Log.w("RecognizeResponse", "vinfo not set.")
    }
    return result
}

@JsonClass(generateAdapter = true)
data class VadResult(
    @Json(name = "ws") val results: List<VadInfo>
)

@JsonClass(generateAdapter = true)
data class VadInfo(
    // 起始的端点帧偏移值
    @Json(name = "bg") val beginInFrame: Int,
    // 结束的端点帧偏移值
    @Json(name = "ed") val endInFrame: Int
) {
    companion object {
        const val MILLIS_PER_FRAME = 10
    }

    val beginMillis get() = beginInFrame.toLong().times(MILLIS_PER_FRAME)
    val endMillis get() = endInFrame.toLong().times(MILLIS_PER_FRAME)
}

@JsonClass(generateAdapter = true)
data class Word(
    // 起始的端点帧偏移值
    @Json(name = "bg") val offset: Int,
    // 中文分词
    @Json(name = "cw") val unitList: List<WordUnit>
)

@JsonClass(generateAdapter = true)
data class WordUnit(
    // 字词
    @Json(name = "w") val text: String
)

 

2021-03-01 15:56:16.026 21006-21385/co.logre V/dynamicResult: dynamicResult: 语音

{"result":{"lastSection":false,"sequenceNumber":1,"words":[{"offset":0,"unitList":[{"text":"语音"}]}]},"status":0}

2021-03-01 15:56:16.478 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析

{"result":{"lastSection":false,"sequenceNumber":2,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]}]},"status":1}

2021-03-01 15:56:17.123 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析自然

{"result":{"lastSection":false,"sequenceNumber":3,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]}]},"status":1}

2021-03-01 15:56:17.281 21006-21108/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言

{"result":{"lastSection":false,"sequenceNumber":4,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]},{"offset":0,"unitList":[{"text":"语言"}]}]},"status":1}

2021-03-01 15:56:17.450 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理

{"result":{"lastSection":false,"sequenceNumber":5,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]},{"offset":0,"unitList":[{"text":"语言"}]},{"offset":0,"unitList":[{"text":"处理"}]}]},"status":1}

2021-03-01 15:56:18.085 21006-21394/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀

{"result":{"lastSection":false,"sequenceNumber":6,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]},{"offset":0,"unitList":[{"text":"语言"}]},{"offset":0,"unitList":[{"text":"处理"}]},{"offset":0,"unitList":[{"text":"呀"}]}]},"status":1}

2021-03-01 15:56:18.404 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容

{"result":{"lastSection":false,"sequenceNumber":7,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]},{"offset":0,"unitList":[{"text":"语言"}]},{"offset":0,"unitList":[{"text":"处理"}]},{"offset":0,"unitList":[{"text":"呀"}]},{"offset":0,"unitList":[{"text":"内容"}]}]},"status":1}

2021-03-01 15:56:18.568 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容是

2021-03-01 15:56:18.879 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核

2021-03-01 15:56:19.521 21006-21107/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像

2021-03-01 15:56:19.839 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别

2021-03-01 15:56:20.526 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸

2021-03-01 15:56:20.640 21006-21410/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别

{"result":{"lastSection":false,"sequenceNumber":11,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"分析"}]},{"offset":0,"unitList":[{"text":"自然"}]},{"offset":0,"unitList":[{"text":"语言"}]},{"offset":0,"unitList":[{"text":"处理"}]},{"offset":0,"unitList":[{"text":"呀"}]},{"offset":0,"unitList":[{"text":"内容"}]},{"offset":0,"unitList":[{"text":"审核"}]},{"offset":0,"unitList":[{"text":"图像"}]},{"offset":0,"unitList":[{"text":"识别"}]}]},"status":1}

2021-03-01 15:56:21.440 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别文字

2021-03-01 15:56:21.602 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音分析自然语言处理呀内容审核图像识别人脸识别文字的

2021-03-01 15:56:22.562 21006-21386/co.logre V/dynamicResult: dynamicResult: 语音分析,自然语言处理,内容审核,图像识别,人脸识别文字识别

{"result":{"lastSection":false,"sequenceNumber":16,"words":[{"offset":126,"unitList":[{"text":"语音"}]},{"offset":170,"unitList":[{"text":"分析"}]},{"offset":246,"unitList":[{"text":","}]},{"offset":246,"unitList":[{"text":"自然"}]},{"offset":286,"unitList":[{"text":"语言"}]},{"offset":318,"unitList":[{"text":"处理"}]},{"offset":386,"unitList":[{"text":","}]},{"offset":386,"unitList":[{"text":"内容"}]},{"offset":430,"unitList":[{"text":"审核"}]},{"offset":498,"unitList":[{"text":","}]},{"offset":498,"unitList":[{"text":"图像"}]},{"offset":546,"unitList":[{"text":"识别"}]},{"offset":602,"unitList":[{"text":","}]},{"offset":602,"unitList":[{"text":"人脸识别"}]},{"offset":694,"unitList":[{"text":"文字"}]},{"offset":734,"unitList":[{"text":"识别"}]}]},"status":1}

2021-03-01 15:56:22.901 21006-21396/co.logre V/dynamicResult: dynamicResult: 语音

2021-03-01 15:56:23.060 21006-21404/co.logre V/dynamicResult: dynamicResult: 语音的

{"result":{"lastSection":false,"sequenceNumber":18,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"的"}]}]},"status":1}

2021-03-01 15:56:23.396 21006-21404/co.logre V/dynamicResult: dynamicResult: 语音

{"result":{"lastSection":false,"sequenceNumber":19,"words":[{"offset":0,"unitList":[{"text":"语音"}]}]},"status":1}

2021-03-01 15:56:23.545 21006-21387/co.logre V/dynamicResult: dynamicResult: 语音邮件

{"result":{"lastSection":false,"sequenceNumber":20,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"邮件"}]}]},"status":1}

2021-03-01 15:56:24.029 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要

{"result":{"lastSection":false,"sequenceNumber":21,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"邮件"}]},{"offset":0,"unitList":[{"text":"你"}]},{"offset":0,"unitList":[{"text":"不要"}]}]},"status":1}

2021-03-01 15:56:24.187 21006-21109/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要服务员

{"result":{"lastSection":false,"sequenceNumber":22,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"邮件"}]},{"offset":0,"unitList":[{"text":"你"}]},{"offset":0,"unitList":[{"text":"不要"}]},{"offset":0,"unitList":[{"text":"服务员"}]}]},"status":1}

2021-03-01 15:56:24.506 21006-21386/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要扶

{"result":{"lastSection":false,"sequenceNumber":23,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"邮件"}]},{"offset":0,"unitList":[{"text":"你"}]},{"offset":0,"unitList":[{"text":"不要"}]},{"offset":0,"unitList":[{"text":"扶"}]}]},"status":1}

2021-03-01 15:56:24.972 21006-21405/co.logre V/dynamicResult: dynamicResult: 语音邮件你不要扶自己祝福

{"result":{"lastSection":false,"sequenceNumber":24,"words":[{"offset":0,"unitList":[{"text":"语音"}]},{"offset":0,"unitList":[{"text":"邮件"}]},{"offset":0,"unitList":[{"text":"你"}]},{"offset":0,"unitList":[{"text":"不要"}]},{"offset":0,"unitList":[{"text":"扶"}]},{"offset":0,"unitList":[{"text":"自己"}]},{"offset":0,"unitList":[{"text":"祝福"}]}]},"status":1}

2021-03-01 15:56:25.915 21006-21109/co.logre V/dynamicResult: dynamicResult: ,语音硬件医疗服务及服务

{"result":{"lastSection":false,"sequenceNumber":25,"words":[{"offset":834,"unitList":[{"text":","}]},{"offset":834,"unitList":[{"text":"语音"}]},{"offset":874,"unitList":[{"text":"硬件"}]},{"offset":914,"unitList":[{"text":"医疗"}]},{"offset":970,"unitList":[{"text":"服务"}]},{"offset":1030,"unitList":[{"text":"及"}]},{"offset":1066,"unitList":[{"text":"服务"}]}]},"status":1}

2021-03-01 15:56:27.037 21006-21107/co.logre V/dynamicResult: dynamicResult: 。

{"result":{"lastSection":true,"sequenceNumber":26,"words":[{"offset":0,"unitList":[{"text":"。"}]}]},"status":2}

2021-03-01 15:56:31.698 21006-21387/co.logre V/dynamicResult: dynamicResult:

官方文档中提到的:

data.result.ws.bgint起始的端点帧偏移值,单位:帧(1帧=10ms)
注:以下两种情况下bg=0,无参考意义:
1)返回结果为标点符号或者为空;2)本次返回结果过长。

可以看到,动态翻译完最准的话,data中"offset"字段的属性值不为0(即data.result.ws.bg != 0),所以过滤出offset不为0的数据组装即可。

源代码:

if (responseData.result?.words?.isNotEmpty()!!) {

    if (responseData.result.words[0].offset != 0) {

        Log.v("responseData.result", "responseData: " + responseData.linkResults())

    }

}

 

过滤完之后的数据:

2021-03-01 17:03:05.188 23915-24001/co.logre V/responseData.result: responseData: 语音分析,自然语言处理,内容审核,图像识别,人脸识别,文字识别

2021-03-01 17:03:08.702 23915-24001/co.logre V/responseData.result: responseData: ,语音硬件医疗服务协助服务

----------------------

如果需要获取每句语音的时长,官方的描述是:

vinfo返回参数  (这种是非动态的,上面提到的官方贴出来的是动态的。)

若设置了vinfo=1,还有如下字段返回(若同时开通并设置了dwa=wpgs,则vinfo失效):

参数类型描述
data.result.vadobject端点帧偏移值信息
data.result.vad.wsarray端点帧偏移值结果
data.result.vad.bgint起始的端点帧偏移值,单位:帧(1帧=10ms)
data.result.vad.edint结束的端点帧偏移值,单位:帧(1帧=10ms)
data.result.vad.egnumber无需关心

 

那么:

若是非动态的情况下,发的vinfo=1:

用vad里直接就能拿到的bg,ed做差拿到帧差,之后就得到了语音的毫秒数(1帧=10ms)

 

若是动态修正的情况下:

则以每句第一个单词的data.result.ws.bg作为起始帧的位置,最后一个单词的data.result.ws.bg作为结束帧的位置,算出帧差之后就得到了语音的毫秒数(1帧=10ms)

 

 

 类似资料: