Whither Speech Recognition: 25年又一个25年

孙宏壮
2023-12-01

Pierce’s harsh criticism

In deception, studied and artful deceit is apt to succeed better and more quickly than science. Indeed, a wag has proposed that computers are becoming so nearly human that they can act without thinking. That must not be true! Humans are there for the more important position of interpretation & explanation & summarize what they learn as insights.

Spoken English is, in general, simply not recognizable phoneme by phoneme or word by word, and that people recognize utterances, not because they hear the phonetic features or the words distinctly, but because they have a general sense of what a conversation is about and are able to guess what has been said. He is somewhat obsessed with the fantasy of human-level intelligence, while words on the opposite side stressed what good to do to the consumers.

When we listen to a person speaking or read a page of print, much of what we think we see or hear is supplied from our memory. When we go to a foreign lecture, what troubles us is not so much that we cannot understand what the actors say as that we cannot hear their words. The fact is that we hear quite as little under similar conditions at home, only our minds, being fuller of English verbal associations, supplies the requisite material for comprehension upon a much slighter auditory hint. — [William James, 1899]
A native speaker can understand a conversation on a noisy street-car where a foreigner very fluent in the language cannot.

These considerations lead us to believe that a general phonetic typewriter is simply impossible unless the typewriter has an intelligence and a knowledge of language comparable to those of a native speaker. Performance will continue to be very limited unless the recognizing device understands what is being said.

Most recognizers (He meant those paople who develop speech recognition systems, not the systems themselves) behave, not like scientists, but like mad inventors or untrustworthy engineers. The typical recognizer gets it into his head that he can solve “the problem”. The basis for this is either individual inspiration (the “mad inventor” source of knowledge) or acceptance of untested rules, schemes, or information (the untrustworthy engineer approach).


Anti-Pierce: a more pragmatic & objective view of where we are now

其实这篇对ASR的立场跟 [Church & Hovy 1993 Good Applications for Crummy MT] 对 MT 的立场非常像: 务实,谨慎,谦虚,Modesty, Modesty, Modesty. 强调Incremental work,但有着积跬步以之千里的定力

(全文背诵):Pierce’s influential article was successful in curtailing, but not stopping, speech recognition research. In 25 years, speech recognition has evolved from a “futile” endeaver to commercial reality. … Speech recognition will proceed incrementally, but inevitably, forward. Applications will be deployed — applications that seem simple, but that prove beneficial to the society.


Dimensions of Speech Recognition

Humans are able to understand speech so earily that we often fail to appreciate the difficulties that this task poses for machines.

Here are some of the dimensions in which machine performance falls short.

  • Degree of speaker independence
  • Vocabulary complexity
  • Speaking rate, coarticulation
  • Speaker variability (e.g. in loudness, speed, stress, extraneous coughs or 'um’s)
  • Channel conditions

In the pattern matching philosophy, there are three stages of ASR: speech feature analysis, pattern classification, and language processing.
  1. Speech Feature Analysis

Current models are oversimplified and represent only the superficial aspects of the physiology of hearing and speech recognition. Future improvements in speech and hearing models should pay off directly in higher acoustic discrimination power.

  1. Pattern Classification

Because of rate of speaking may vary, a dynamic time warping technique is used to stretch or shrink the time axis to minimize the distortion to the template. It is hard to provide error recovery if an incorrect conclusion is drawn at an early decision point.

The differences between the rule-based and template approaches resulted in a philosophical split in the community until the early 1980s, when both approaches were surpassed by a more powerful theory, HMM. The principal advantage of HMM systems is that HMMs retain more statistical information about the complete distribution of features present in the training data. This translates to greater discrimination power. It has proven difficult for neural networks to achieve the level of time alignment of the speech signal that HMMs have attained. So, neural networks are most often used today as static discriminators in systems based on a HMM statistical framework.

  1. Language Processing

Of the three stages, LM is the weak link at this time. The difficulties of pattern cls and feature analysis have been solved at least partly by “brute force” techniques aided by fast, powerful computers. But the technology of language modelling is still at its infancy.


Keep our heads down: current applications are still humble
  1. Telecommunications
    Field trials have shown that the ability to spot the key words in speech is a prerequisite for most telephone network applications. Ease of use is the key. The trials were considered successful not just from a technology point of view, but also because customers were willing to use the service.

  2. Voice Dictation
    Just as typing is a skill that must be practiced, distating by voice with these systems requires an initial effort. The prospective user must first train the system to his or her voice. The word-error depends on the skill of the speaker and the similarity of the text to the LM.

  3. Speech Understanding for Data Retrieval
    Two aims: 1). large-vocab, continuous speech recognition and 2). interactive problem-solving. Both efforts aim to provide real-time, speaker-independent or speaker-adaptive technology to handle spontaneity. In 1991, spoken-language understanding research was begun with the collection of spontaneous queries about air travel. Several groups have built on-line demos of this task that run in real time on a workstation. Although not ready yet for field applications, the use of ASR in highly constrained tasks will be within the range of practibility within the next few years.


Prognosis (a forecast of the likely outcome of a situation)

It has been observed that predictions of future technologies tend to be overly optimistic for the short term and overly pessimistic for the long haul. Such forecasts can have the unfortunate effect of creating unrealistic expectations, followed by pre-mature abandonment of the effort.

Predicting 25 years in the future may be futile because it is impossible to predict when a revolution will occur. There are still engineering improvements that can be built on today’s science. Speech recognition puts the burden on the machine to accomodate human skill in speaking and listening, rather than imposing on a person to communicate in a way concenient to the machine. (though people will learn to modify their speech habits to use speech recognition devices.)


The above motivations will continue to be powerful stimulants for further research, some specific predictions are listed below:

  • Topic-specific, speaker specific
  • Major advances will be made in language modelling
  • Voice-command
  • Restricted domain applications for which restricted semantic info is available. (unrestricted domains will remain a formidable challenge, and will not be successfully deployed until fundamental advances in understanding of the structures of spoken language. Stop kicking out linguists! Stop blindly riding the tide without the awareness of what is supporting it at the base!)
 类似资料: