Patent No. 6487531 Signal injection coupling into the human vocal tract for robust audible and inaudible voice recognition
Patent No. 6487531
Signal injection coupling into the human vocal tract for robust audible and inaudible voice recognition (Tosaya, et al., Nov 26, 2002)
Abstract
A means and method are provided for enhancing or replacing the natural excitation of the human vocal tract by artificial excitation means, wherein the artificially created acoustics present additional spectral, temporal, or phase data useful for (1) enhancing the machine recognition robustness of audible speech or (2) enabling more robust machine-recognition of relatively inaudible mouthed or whispered speech. The artificial excitation (a) may be arranged to be audible or inaudible, (b) may be designed to be non-interfering with another user's similar means, (c) may be used in one or both of a vocal content-enhancement mode or a complimentary vocal tract-probing mode, and/or (d) may be used for the recognition of audible or inaudible continuous speech or isolated spoken commands.
Notes:
TECHNICAL
FIELD
The present invention is directed generally to voice recognition, and, more
particularly to a means and method for enhancing or replacing the natural excitation
of a living body's vocal tract by artificial excitation means.
BACKGROUND ART
The ability to vocally converse with a computer is a grand and worthy goal of
hundreds of researchers, universities and institutions all over the world. Such
a capability is widely expected to revolutionize communications, learning, commerce,
government services and many other activities by making the complexities of
technology transparent to the user. In order to converse, the computer must
first recognize what words are being said by the human user and then must determine
the likely meaning of those words and formulate meaningful and appropriate ongoing
responses to the user. The invention herein addresses the recognition aspect
of the overall speech understanding problem.
It is well known that the human vocal system can be roughly approximated as
a source driving a digital (or analog) filter; see, e.g., M. Al-Akaidi, "Simulation
model of the vocal tract filter for speech synthesis", Simulation, Vol. 67,
No. 4, p. 241-246 (October 1996). The source is the larynx and vocal chords
and the filter is the set of resonant acoustic cavities and/or resonant surfaces
created and modified by the many movable portions (articulators) of the throat,
tongue, mouth/throat surfaces, lips and nasal cavity. These include the lips,
mandible, tongue, velum and pharynx. In essence, the source creates one or both
of a quasi-periodic vibration (voiced sounds) or a white noise (unvoiced sounds)
and the many vocal articulators modify that excitation in accordance with the
vowels, consonants or phonemes being expressed. In general, the frequencies
between 600 to 4,000 Hertz contain the bulk of the necessary acoustic information
for human speech perception (B. Bergeron, "Using an intraural microphone interface
for improved speech recognition", Collegiate Microcomputer, Vol. 8, No. 3, pp.
231-238 (August 1990)), but there is some human-hearable information all the
way up to 10,000 hertz or so and some important information below 600 hertz.
The variable set of resonances of the human vocal tract are referred to as formants
and are indicated as F1, F2 . . . . In general, the lower frequency formants
F1 and F2 are usually in the range of 250 to 3,000 hertz and contain a major
portion of human-hearable information about many articulated sounds and phonemes.
Although the formants are principle features of human speech, they are by far
not the only features and even the formants themselves dynamically change frequency
and amplitude, depending on context, speaking rate, and mood. Indeed, only experts
have been able to manually determine what a person has said based on a printout
of the spectrogram of the utterance and even this analysis contains best-guesses.
Thus, automated speech recognition is one of the grand problems in linguistic
and speech sciences. In fact, only the recent application of trainable stochastic
(statistics-based) models using fast microprocessors (e.g., 200 Mhz or higher)
has resulted in 1998's introduction of inexpensive continuous speech (CS) software
products. In the stochastic models used in such software, referred to as Hidden
Markov Models (HMMs), the statistics of varying annunciation and temporal delivery
are statistically captured in oral training sessions and made available as models
for the internal search engine(s).
Major challenges to speech recognition software and systems development progress
have historically been that (a) continuous speech (CS) is very much more difficult
to recognize than single isolated-word speech and (b) different speakers have
very different voice patterns from each other. The former is primarily because
in continuous speech, we pronounce and enunciate words depending on their context,
our moods, our stress state, and on the speed with which we speak. The latter
is because of physiological, age, sex, anatomical, regional accent, and other
reasons. Furthermore, another major problem has been how to reproducibly get
the sound (natural speech) into the recognition system without loss or distortion
of the information it contains. It turns out that the positioning of and type
of microphone(s) or pickups one uses are critical. Head-mounted oral microphones,
and the exact positioning thereof, have been particularly thorny problems despite
their superior frequency response. Some attempts to use ear pickup microphones
(see, e.g., Bergeron, supra) have shown fair results despite the known poorer
passage of high frequency content through the bones of the skull. This result
sadly speaks volumes to the positioning difficulty implications of mouth microphones
which should give substantially superior performance based on their known and
understood broader frequency content.
Recently, two companies, IBM and Dragon Systems, have offered commercial PC-based
software products (IBM ViaVoice.TM. and Dragon Naturally Speaking.TM.) that
can recognize continuous speech with fair accuracy after the user conducts carefully
designed mandatory training or "enrollment" sessions with the software. Even
with such enrollment, the accuracy is approximately 95% under controlled conditions
involving careful microphone placement and minimal or no background noise. If,
during use, there are other speakers in the room having separate conversations
(or there are reverberant echoes present), then numerous irritating recognition
errors can result. Likewise, if the user moves the vendor-recommended directional
or noise-canceling microphone away, or too far, from directly in front of the
lips, or speaks too softly, then the accuracy goes down precipitously. It is
no wonder that speech recognition software is not yet significantly utilized
in mission-critical applications.
The inventors herein address the general lack of robustness described above
in a manner such that accuracy during speaking can be improved, training (enrollment)
can be a more robust if not a continuous improvement process, and one may speak
softly and indeed even "mouth words" without significant audible sound generation,
yet retain recognition performance. Finally, the inventors have also devised
a means for nearby and/or conversing speakers using voice-recognition systems
to automatically have their systems adapted to purposefully avoid operational
interference with each other. This aspect has been of serious concern when trying
to insert voice recognition capabilities into a busy office area wherein numerous
interfering (overheard) conversations cannot easily be avoided.
The additional and more reproducible artificial excitations of the invention
may also be used to increase the acoustic uniqueness of utterances-thus speeding
up speech recognition processing for a given recognition-accuracy requirement.
Such a speedup could, for example, be realized from the reduction in the number
of candidate utterances needing software-comparison. In fact, such reductions
in utterance identification possibilities also improve recognition accuracy
as there are fewer incorrect conclusions to be made.
Utterance or speech-recognition practiced using the invention may have any purpose
including, but not limited to: (1) talking to, commanding or conversing with
local or remote computers, computer-containing products, telephony products
or speech-conversant products (or with other persons using them); (2) talking
to or commanding a local or remote system that converts recognized speech or
commands to recorded or printed text or to programmed actions of any sort (e.g.:
voice-mail interactive menus, computer-game control systems); (3) talking to
another person(s) locally or remotely-located wherein one's recognized speech
is presented to the other party as text or as a synthesized voice (possibly
in his/her different language); (4) talking to or commanding any device (or
connected person) discretely or in apparent silence; (5) user-identification
or validation wherein security is increased over prior art speech fingerprinting
systems due to the additional information available in the speech signal or
even the ability to manipulate artificial excitations oblivious to the user;
(6) allowing multiple equipped speakers to each have their own speech recognized
free of interference from the other audible speakers (regardless of their remote
locations or collocation); (7) adapting a users "speech" output to obtain better
recognition-processing performance as by adding individually-customized artificial
content for a given speaker and making that content portable if not network-available.
(This could also eliminate or minimize retraining of new recognition systems
by new users.)
DISCLOSURE OF INVENTION
In accordance with the present invention, a means and method are disclosed for
enhancing or replacing the natural excitation of the human vocal tract by artificial
excitation means wherein the artificially created acoustics present additional
spectral, temporal or phase data useful for (1) enhancing the machine recognition
robustness of audible speech or (2) enabling more robust machine-recognition
of relatively inaudible mouthed or whispered speech. The artificial excitation
may be arranged to be audible or inaudible, may be designed to be non-interfering
with another users similar means, may be used in one or both of a vocal content-enhancement
mode or a complimentary vocal tract-probing mode and may be used for the recognition
of audible or inaudible continuous speech or isolated spoken commands.
Specifically, an artificial acoustic excitation means is provided for acoustic
coupling into a functional vocal tract working in cooperation with a speech
recognition system wherein the artificial excitation coupling characteristics
provide(s) information useful to the identification of speech by the system.
The present invention extends the performance and applicability of speech-recognition
in the following ways: (1) Improves speech-recognition accuracy and/or speed
for audible speech; (2) Eliminates recognition-interference (accuracy degradation)
due to competing speakers or voices, (e.g., as in a busy office with many independent
speakers); (3) Newly allows for voice-recognition of silent or mouthed/whispered
speech (e.g., for discretely interfacing with speech-based products and devices);
and Improves security for speech-based user-identification or user-validation
In essence, the human vocal tract is artificially excited, directly or indirectly,
to produce sound excitations, which are articulated by the speaker. These sounds,
because they are artificially excited, have far more latitude than the familiar
naturally excited voiced and aspirated human sounds. For example, they may or
may not be audible, may excite natural vocal articulators (audibly or inaudibly)
and/or may excite new articulators (audibly or inaudibly).
Artificially excited "speech" output may be superimposed on normal speech to
increase the raw characteristic information content. Artificially excited output
may be relatively or completely inaudible thus also allowing for good recognition-accuracy
while whispering or even mouthing words. Artificial content may help discern
between competing speakers thus-equipped, whether they are talking to each other
or are in separate cubicles. Artificial content may also serve as a user voiceprint.
Systems taking advantage of this technology may be used for continuous speech
or command-style discrete speech. Such systems may be trained using one or both
of natural speech and artificial speech.
The artificial excitations may incorporate any of several features including:
(a) broadband excitation, (b) narrow band excitation(s) such as a harmonic frequency
of a natural formant, (c) multiple tones wherein the tones phase-interact with
articulation (natural speech hearing does not significantly involve phase),
(d) excitations which are delivered (or processed) only as a function of the
success of ongoing natural speech recognition, and (e) excitations which are
feedback-optimized for each speaker.
The user need not be aware of the added acoustic information nor of it's processing.
Consumer/business products incorporating the technology may include computers,
PCs, office-wide systems, PDAs, terminals, telephones, games, or any speech-conversant,
speech-controlled or sound-controlled appliance or product. For the discrete
inaudible option, such products could be used in public with relative privacy.
Additional police, military and surveillance products are likely.
Other objects, features, and advantages of the present invention will become
apparent upon consideration of the following detailed description and accompanying
drawings, in which like reference designations represent like features throughout
the FIGURES.
INDUSTRIAL
APPLICABILITY
The voice recognition scheme disclosed herein is expected to find use in a wide
variety of applications, including (a) provision of a robust speech interface
to computers, terminals, personal electronic products, games, security devices
and identification devices, (b) for non-interfering recognition with multiple
speakers or voices present, (c) for the automatic recognition of multiple speakers
and discerning them from each other, (d) for discrete or silent speaking or
command-giving speech recognition, and (e) for the option of having a portable
user-customized artificial enhancement excitation useable with more than one
recognition system.
Thus, there has been disclosed a voice
recognition scheme involving signal injection coupling into the human vocal
tract for robust audible and inaudible voice recognition. It will be readily
apparent to those skilled in this art that various changes and modifications
of an obvious nature may be made, and all such changes and modifications are
considered to fall within the scope of the present invention, as defined by
the appended claims.
Comments