What
is Automatic Speech Recognition?
Automatic speech recognition (ASR) can be defined as
the independent, computer‐driven
transcription of spoken language into readable text in real time. In a
nutshell, ASR is technology that allows a computer to identify the words that a
person speaks into a microphone or telephone and convert it to written text.
Having a machine to understand fluently spoken
speech has driven speech research for more than 50 years. Although ASR
technology is not yet at the point where machines understand all speech, in any
acoustic environment, or by any person, it is used on a day‐to‐day basis in a number
of applications and services.
The ultimate goal of ASR research is to allow a
computer to recognize in real‐time,
with 100% accuracy, all words that are intelligibly spoken by any person,
independent of vocabulary size, noise, speaker characteristics or accent. Today, if the system is trained to learn an
individual speaker's voice, then much larger vocabularies are possible and accuracy
can be greater than 90%.
Commercially available ASR systems usually require
only a short period of speaker training and may successfully capture continuous
speech with a large vocabulary at normal pace with a very high accuracy. Most
commercial companies claim that recognition software can achieve between 98% to
99% accuracy if operated under optimal conditions. `Optimal conditions' usually
assume that users: have speech characteristics which match the training data,
can achieve proper speaker adaptation, and work in a clean noise environment.
This explains why some users, especially those whose
speech is heavily accented, might achieve recognition rates much lower than
expected.
How
Does ASR Work?
The goal of an ASR system is to accurately and
efficiently convert a speech signal into a text message transcription of the
spoken words independent of the speaker, environment or the device used to
record the speech (i.e. the microphone).
This process begins when a speaker decides what to
say and actually speaks a sentence. (This is a sequence of words possibly with
pauses, uh’s, and um’s.) The software then produces a speech wave form, which
embodies the words of the sentence as well as the extraneous sounds and pauses
in the spoken input. Next, the software attempts to decode the speech into the
best estimate of the sentence. First it converts the speech signal into a
sequence of vectors which are measured throughout the duration of the speech
signal. Then, using a syntactic decoder it generates a valid sequence of
representations.
0 comments:
Post a Comment