Patent No. 5729694 Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Patent No. 5729694
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves (Holzrichter, et al., Mar 17, 1998)
Abstract
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.
Notes:
Government
Interests
The United States Government has rights in this invention pursuant to Contract
No. W-7405-ENG-48 between the United States Department of Energy and the University
of California for the operation of Lawrence Livermore National Laboratory.
BACKGROUND OF THE INVENTION
The invention relates generally to the characterization of human speech using
combined EM wave information and acoustic information, for purposes of speech
coding, speech recognition, speech synthesis, speaker identification, and related
speech technologies.
Speech Characterization and Coding:
The history of speech characterization, coding, and generation has spanned the
last one and one half centuries. Early mechanical speech generators relied upon
using arrays of vibrating reeds and tubes of varying diameters and lengths to
make human-voice-like sounds. The combinations of excitation sources (e.g.,
reeds) and acoustic tracts (e.g., tubes) were played like organs at theaters
to mimic human voices. In the 20th century, the physical and mathematical descriptions
of the acoustics of speech began to be studied intensively and these were used
to enhance many commercial products such as those associated with telephony
and wireless communications. As a result, the coding of human speech into electrical
signals for the purposes of transmission was extensively developed, especially
in the United States at the Bell Telephone Laboratories. A complete description
of this early work is given by J. L. Flanagan, in "Speech Analysis, Synthesis,
and Perception", Academic Press, New York, 1965. He describes the physics of
speech and the mathematics of describing acoustic speech units (i.e., coding).
He gives examples of how human vocal excitation sources and the human vocal
tracts behave and interact with each other to produce human speech.
The commercial intent of the early telephone work was to understand how to use
the minimum bandwidth possible for transmitting acceptable vocal quality on
the then-limited number of telephone wires and on the limited frequency spectrum
available for radio (i.e. wireless) communication. Secondly, workers learned
that analog voice transmission uses typically 100 times more bandwidth than
the transmission of the same word if simple numerical codes representing the
speech units such as phonemes or words are transmitted. This technology is called
"Analysis-Synthesis Telephony" or "Vocoding". For example, sampling at 8 kHz
and using 16 bits per analog signal value requires 128 kbps, but the Analysis
Synthesis approach can lower the coding requirements to below 1.0 kbps. In spite
of the bandwidth advantages, vocoding has not been used widely because it requires
accurate automated phoneme coding and resynthesis; otherwise the resulting speech
tends to have a "machine accent" and be of limited intelligibility. One major
aspect of the difficulty of speech coding is adequacy of the excitation information,
including the pitch measurement, the voiced-unvoiced discrimination, and the
spectrum of the glottal excitation pulse.
Progress in speech acoustical understanding and mathematical modeling of the
vocal tract has continued and become quite sophisticated, mostly in the laboratory.
It is now reasonably straightforward to simulate human speech by using differential
equations which describe the increasingly complex concatenations of sound excitation
sources, vocal tract tubes, and their constrictions and side branches (e.g.,
vocal resonators). Transform methods (e.g. electrical analogies solved by Fourier,
Laplace, Z-transforms, etc.) are used for simpler cases and sophisticated computational
modeling on supercomputers for increasingly complex and accurate simulations.
See Flanagan (ibid.) for early descriptions of modeling, and Schroeter and Sondhi,
"A hybrid time-frequency domain articulator speech synthesizer", IEEE Trans.
on Acoustic Speech, ASSP 35(7) 1987 and "Techniques for Estimating Vocal-Tract
Shapes from the Speech Signal", ASSP 2(1), 1343, 1994. These papers reemphasize
that it is not possible to work backwards from the acoustic output to obtain
a unique mathematical description of the combined vocal fold-vocal tract system,
which is called the "inverse problem" herein. It is not possible to obtain information
that separately describes both the "zeros" in speech air flow caused by glottal
(i.e., vocal fold) closure and those caused by closed, or resonant structures
in the vocal tract. As a result, it is not possible to use the well developed
mathematics of modern signal acquisition, processing, coding, and reconstructing
to the extent needed.
In addition, given a mathematical vocal system model, it remains especially
difficult to associate it with a unique individual because it is very difficult
to obtain the detailed physiological vocal tract features of a given individual
such as tract lengths, diameters, cross sectional shapes, wall compliance, sinus
size, glottal size and compliance, lung air pressure, and other necessary parameters.
In some cases, deconvolving the excitation source from the acoustic output can
be done for certain sounds where the "zeros" are known to be absent, so the
major resonant structures such as tract lengths can be determined. For example,
simple acoustic resonator techniques (see the 1976 U.S. Pat. No. 4,087,632 by
Hafer) are used to derive the tongue body position by measuring the acoustic
formant frequencies (i.e., the vocal tube resonance frequencies) and to constrain
the tongue locations and tube lengths against an early, well known vocal tract
model by Coker, "A Model of Articulatory Dynamics and Control", Proc. of IEEE,
Vol. 64(4), 452-460, 1976. The problem with this approach is that only gross
dimensions of the tract are obtained, but detailed vocal tract features are
needed to unambiguously define the physiology of the human doing the speaking.
For more physiological details, x-ray imaging of the vocal tract has been used
to obtain tube lengths, diameters, and resonator areas and structures. Also
the optical laryngoscope, inserted into the throat, to view the vocal fold open
and close cycles, is used in order to observe their sizes and time behavior.
The limit to further performance improvements in acoustic speech recognition,
in speech synthesis, in speaker identification, and other related technologies
is directly related to our inability to accurately solve the inverse problem.
Present workers are unable to use acoustic speech output to work backwards to
accurately and easily determine the vocal tract transfer function, as well as
the excitation amplitude versus time. The "missing" information about the separation
of the excitation function from the vocal tract transfer function leads to many
difficulties in automating the coding of the speech for each speech time frame
and in forming speech sound-unit libraries for speech-related technologies.
A major reason for the problem is that workers have been unable to measure the
excitation function in real time. This has made it difficult to automatically
identify the start and stop of each voiced speech segments over which a speech
sound unit is constant. This has made it difficult to join (or to unjoin) the
transitions between sequential vocalized speech units (e.g., syllables, phonemes
or multiplets of phonemes) as an individual human speaker articulates sounds
at rates of approximately 10 phonemes per second or two words per second.
The lack of precision in speech segment identification adds to the difficulty
in obtaining accurate model coefficients for both the excitation function and
the vocal tract. Further, this leads to inefficiencies in the algorithms and
the computational procedures required by the technological application such
as speech recognition. In addition, the difficulties described above prevent
the accurate coding of the unique acoustic properties of a given individual
for personalized, human speech synthesis or for pleasing vocoding. In addition,
the "missing" information prevents complete separation of the excitation from
the transfer function, and limits accurate speaker-independent speech-unit coding
(speaker normalization). The incomplete normalization limits the ability to
conduct accurate and rapid speech recognition and/or speaker identification
using statistical codebook lookup techniques, because the variability of each
speaker's articulation adds uncertainty in the matching process and requires
additional statistical processing. The missing information and the timing difficulties
also inhibit the accurate handling of co-articulation, incomplete articulation,
and similar events where words are run together in the sequences of acoustic
units comprising a speech segment.
In the 1970s, workers in the field of speech recognition showed that short "frames"
(e.g., 10 ms intervals) of the time waveform of a speech signal could be well
approximated by an all poles (but no zeros) analytic representation, using numerical
"linear predictive coding" (LPC) coefficients found by solving covariance equations.
Specific procedures are described in B. S. Atal and S. L. Hanauer, "Speech analysis
and synthesis by linear prediction of the speech wave", J. Acoust. Soc. Am.
50(2), pp. 63, 1971. The LPC coefficients are a form of speech coding and have
the advantage of characterizing acoustic speech with a relatively small number
of variables-typically 20 to 30 per frame as implemented in today's systems.
They make possible statistical table look up of large numbers of word representations
using Hidden Markov techniques for speech recognition.
In speech synthesizers, code books of acoustic coefficients (e.g., using well
known LPC, PARCOR, or similar coefficients) for each of the phonemes and for
a sufficient number of diphonemes (i.e. phoneme pairs) are constructed. Upon
demand from text-to-speech generators, they are retrieved and concatenated to
generate synthetic speech. However, as an accurate coding technique, they only
approximate the speech frames they represent. Their formation and use is not
based upon using knowledge of the excitation function, and as a result they
do not accurately describe the condition of the articulators. They are also
inadequate for reproducing the characteristics of the given human speaker. They
do not permit natural concatenation into high quality natural speech. They can
not be easily related to an articulatory speech model to obtain speaker-specific
physiological parameters. Their lack of association with the articulatory configuration
makes it difficult to do speaker normalization, as well as to deal with the
coarticulation and incomplete articulation problem of natural speech.
Present Example of Speech Coding:
Rabiner, in "Applications of Voice Processing to Telecommunications" Proc. of
the IEEE 82, 199 February 1994 points out that several modern text-to-speech
synthesis systems in use today by AT&T use 2000 to 4000 diphonemes, which
are needed to simulate the phoneme-tophoneme transitions in the concatenation
process for natural speech sounds. FIG. 1 shows a prior art open loop acoustic
speech coding system in which acoustic signals from a microphone are processed,
e.g. by LPC, and feature vectors are produced and stored in a library. Rabiner
also points out (page 213) that in current synthesis models, the vocal source
excitation and the vocal tract interaction "is grossly inadequate", and also
that "when natural duration and pitch are copied onto a text-to-speech utterance,
. . . the quality of the . . . synthetic speech improves dramatically." Presently,
it is not possible to economically capture the natural pitch duration and voiced
air-pulse amplitude vs. time, as well as individual vocal tract qualities, of
a given individual's voice in any of the presently used models, except by very
expensive and invasive laboratory measurements and computations.
J. L. Flanagan, "Technologies for Multimedia Communications", Proc. IEEE 82,
590, April 1994, describes low bandwidth speech coding: "At fewer than 1 bit
per Nyquist sample, source coding is needed to additionally take into account
the properties of the signal generator (such as voiced/unvoiced distinctions
in speech, and pitch, intensity, and formant characteristics)." There is no
presently, commercially useful method to account for the speech excitation source
in order to minimize the coding complexity and subsequent bandwidth.
EM Sensors and Acoustic Information:
The use of EM sensors for measuring speech organ conditions for the purposes
of speech recognition and related technologies are described in copending U.S.
patent application Ser. No. 08/597,596 by Holzrichter. Although it has been
recognized for many decades in the field of speech recognition that speech organ
position and motion information could be useful, and EM sensors (e.g. rf and
microwave radars) were available to do the measurement, no one had suggested
a system using such sensors to detect the motions and locations of speech organs.
Nor had anyone described how to use this information to code each speech unit
and to use the code in an algorithm to identify the speech unit, or for other
speech technology applications such as synthesis. Holzrichter showed how to
use EM sensor information with simultaneously obtained acoustic data to obtain
the positions of vocal organs, how to define feature vectors from this organ
information to use as a coding technique, and how to use this information to
do high-accuracy speech recognition. He also pointed out that this information
provided a natural method of defining changes in each phoneme by measuring changes
in the vocal organ conditions, and he described a method to automatically define
each speech time frame. He also showed that "photographic quality" EM wave images,
obtained by tomographic or similar techniques, were not necessary for the implementation
of the procedures he described, nor for the procedures described herein.
SUMMARY OF THE INVENTION
Accordingly it is an object of the invention to provide method and apparatus
for speech coding using nonacoustic information in combination with acoustic
information.
It is also an object of the invention to provide method and apparatus for speech
coding using Electromagnetic (EM) wave generation and detection modules in combination
with acoustic information.
It is also an object of the invention to provide method and apparatus for speech
coding using radar in combination with acoustic information.
It is another object of the invention to use micropower impulse radar in conjunction
with acoustic information for speech coding.
It is another object of the invention to use the methods and apparatus provided
for speech coding for the purposes of speech recognition, mathematical approximation,
information storage, speech compression, speech synthesis, vocoding, speaker
identification, prosthesis, language teaching, speech correction, language identification,
and other speech related applications.
The invention is a method and apparatus for ioining nonacoustic and acoustic
data. Nonacoustic information describing speech organs is obtained using Electromagnetic
(EM) waves such as RF waves, microwaves, millimeter waves, infrared or optical
waves at wavelengths that reach the speech organs for measurement. Their information
is combined with conventional acoustic information measured with a microphone.
They are combined, using a deconvolving algorithm, to produce more accurate
speech coding than obtainable using only acoustic information. The coded information,
representing the speech, is then available for speech technology applications
such as speech compression, speech recognition, speaker recognition, speech
synthesis, and speech telephony (i.e., vocoding).
Simultaneously obtained EM sensor and acoustic information are used to define
a time frame and to obtain the details of a human speaker's excitation function
and vocal tract function for each speech time frame. The methods make available
the formation of numerical feature vectors for characterizing the acoustic speech
unit spoken each speech time frame. This makes possible a new method of speech
characterization (i.e., coding) using a more complete and accurate set of information
than has been available to previous workers. Such coding can be used for purposes
of more accurate and more economical speech recognition, speech compression,
speech synthesis, vocoding, speaker identification, teaching, prosthesis, and
other applications.
The present invention enables the user to obtain the transfer function of the
human speech system for each speech time frame defined using the methods herein.
In addition, the present invention includes several algorithmic methods of coding
(i.e., numerically describing) these functions for valuable applications in
speech recognition, speech synthesis, speaker identification, speech transmission,
and many other applications. The coding system, described herein, can make use
of much of the apparatus and data collection techniques described in the copending
patent application Ser. No. 08/547,596, including EM wave generation, transmission,
and detection, as well as data averaging, and data storage algorithms. The procedures
defined in the copending patent application are called NASR or Non Acoustic
Speech Recognition. Procedures based upon acoustic prior art are called CASR
for Conventional Acoustic Speech Recognition, and these procedures are also
used herein to provide processed acoustic information.
The following terms are used herein. An acoustic speech unit is the single or
multiple sound utterance that is being described, recognized, or synthesized
using the methods herein. Examples include syllables, demi-syllables, phonemes,
phone-like speech units (i.e., PLUs), diphones, triphones, and more complex
sound sequences such as words. Phoneme acoustic-speech-units are used for most
of the speech unit examples herein. A speech frame is a time during which speech
organ conditions (including repetitive motions of the vocal folds) and the acoustic
output remain constant within pre-defined values that define the constancy.
Multiple time frames are a sequence of time frames joined together in order
to describe changes in acoustic or speech organ conditions as time progresses.
A speech period, or pitch period is the time the glottis is open and the time
it is closed until the next glottal cycle begins, which include transitions
to unvoiced speech or to silence. A speech segment is a period of time of sounded
speech that is being processed using the methods herein. Glottal tissue includes
vocal fold tissue and surrounding tissue, and glottal open/close cycles are
the same as vocal fold open/close cycles. The word functional, as used herein,
means a mathematical function with both variables and symbolic parametercoefficients,
whereas the word function means a functional with defined numerical parameter-coefficients.
The present methods and apparatus work for all human speech sounds and languages,
as well as for animal sounds generated by vocal organ motions detectable by
EM sensors and processed as described. The examples are based on, but not limited
to American English speech.
1) EM Sensor Generator:
All configurations of EM wave generation and detection modules that meet the
requirements for frequency, timing, pulse format, tissue transmission, and power
(and safety) can be used. EM wave generators may be used which, when related
to the distance from the antenna(s), operate in the EM near-field mode (mostly
non-radiating), in the intermediate-EM-field mode where the EM wave is both
non-radiating and radiating, and in the radiating far-field mode (i.e. most
radars). EM waves in several wavelength-bands between <10.sup.8 to >10.sup.14
Hz can penetrate tissue and be used as described herein. A particular example
is a wide-band microwave EM generator impulse radar, radiating 2.5 GHz signals
and repeating its measurement at a 2 MHz pulse repetition rate, which penetrates
over 10 cm into the head or neck. Such units have been used with appropriate
algorithms to validate the methods. These units have been shown to be economical
and safe for routine human use. The speech coding experiments have been conducted
using EM wave transmit/receive units (i.e., impulse radars) in two different
configurations. In one configuration, glottal open-close information, together
with simultaneous acoustic speech information, was obtained using one microphone
and one radar unit. In a second set of experiments, three EM sensor units and
one acoustic unit were used. In addition, a particular method is described for
improving the accuracy of transmitting and receiving an electromagnetic wave
into the head and neck, for very high accuracy excitation function descriptions.
2) EM Sensor Detector:
Many different EM wave detector modes have been demonstrated for the purpose
of obtaining nonacoustic speech organ information. A multiple pulse, fixed-range-gate
reception system (i.e., field disturbance mode) has been used for vocal fold
motion and nearby tissue motion detection. Other techniques have been used to
determine the positions of other vocal organs to obtain added information on
the condition of the vocal tract. Many other systems are described in the radar
literature on EM wave detection, and can be employed.
3) Configuration structures and Control System:
Many different control techniques for portable and fixed EM sensor/acoustic
systems can be used for the purposes of speech coding. However, the processing
procedures described herein may require additional and different configurations
and control systems. For example, in applications such as high fidelity, "personalized"
speech synthesis, extra emphasis must be placed on the quality of the instrumentation,
the data collection, and the sound unit parsing. The recording environments,
the instrumentation linearity, the dynamic range, the relative timing of the
sensors (e.g. acoustic propagation time from the glottis to the microphone),
the A/D converter accuracy, the processing algorithms' speed and accuracy, and
the qualities of play back instrumentation are all very important.
4) Processing Units and Algorithms:
For each set of received EM signals and acoustic signals there is a need to
process and extract the information on organ positions (or motions) and to use
the coded speech sounds for the purposes of deconvolving the excitation from
the acoustic output, and for tract configuration identification. For example,
information on the positions of the vocal folds (and therefore the open area
for air flow) vs. time is obtained by measuring the reflected EM waves as a
function of time. Similarly, information on the conditions of the lips, jaw,
teeth, tongue, and vellum positions can be obtained by transmitting EM waves
from other directions and using other pulse formats. The reflected and received
signals from the speech organs are stored in a memory and processed every speech
time frame, as defined below. The reflected EM signals can be digitized, averaged,
and normalized, as a function of time, and feature vectors can be formed.
The present invention uses EM sensor data to automatically define a speech time
frame using the number of times that the glottis opens and closes for vocalized
speech, while the conditions of other speech organs and the acoustics remain
substantially constant. The actual speech time frame interval used for the processing
(for either coding or reconstructing) can be adapted to optimize the data processing.
The interval can be described by one or several constant single pitch periods,
by a single pitch period value and a multiplier describing the number of substantially
identical periods over which little sound change occurs, or it can use the pitch
periods to describe a time interval of essentially constant speech but with
"slowly changing" organ or acoustic conditions. The basic glottal-period timing-unit
serves as a master timing clock. The use of glottal periods for master timing
makes possible an automated speech and vocal organ information processing system
for coding spoken speech, for speech compression, for speaker identification,
for obtaining training data, for codebook or library generation, for synchronization
with other instruments, and for other applications. This method of speech frame
definition is especially useful for defining diphones and higher order multiple
sound acoustic speech units, for time compression and alignment, for speaker
speech rate normalization, and for prosody parameter definition and implementation.
Timing can also be defined for unvoiced speech, similarly to the procedures
used for vocalized speech.
Once a speech time frame is defined, the user deconvolves the acoustic excitation
function from the acoustic output function. Both are simultaneously measured
over the defined time frame. Because the mathematical problems of "invertability"
are overcome, much more accurate and efficient coding occurs compared to previous
methods. By measuring the human excitation source function in real time, including
the time during which the vocal folds are closed and the airflow stops (i.e.,
the glottal "zeros"), accurate approximations of these very important functional
shapes can be employed to model each speech unit. As a result of this new capability
to measure the excitation function, the user can employ very accurate, efficient
digital signal processing techniques to deconvolve the excitation function from
the acoustic speech output function. For the first time, the user is able to
accurately and completely describe the human vocal tract transfer function for
each speech unit.
There are three speech functions that describe human speech: E(t)=excitation
function, H(t)=transfer function, and I(t)=output acoustics function. The user
can determine any one of these three speech functions by knowing the two other
functions. The human vocal system operates by generating an excitation function,
E(t), which produces rapidly pulsating air flow (or air pressure pulses) vs.
time. These (acoustic) pulses are convolved with (or filtered by) the vocal
tract transfer function, H(t), to obtain a sound output, I(t). Being able to
measure, conveniently in real time, the input excitation E and the output I,
makes it possible to use linear mathematical processing techniques to deconvolve
E from I. This procedure allows the user to obtain an accurate numerical description
of the speaker's transfer function H. This method conveniently leads to a numerical
Fourier transform of the function H, which is represented as a complex amplitude
vs. frequency. A time domain function is also obtainable. These numerical functions
for H can be associated with model functions, or can be stored in tabular form,
in several ways. The function H is especially useful because it describes, in
detail, each speaker's vocal tract acoustical system and it plays a dominant
role in defining the individualized speech sounds being spoken.
Secondly, a synthesized output acoustic function, I(t), can be produced by convolving
the voiced excitation function, E(t), with the transfer function, H(t), for
each desired acoustic speech unit. Thirdly, the excitation function, E, can
be determined by deconvolving a previously obtained transfer function, H, from
a measured acoustic output function, I. This third method is useful to obtain
the modified-white-noise excitation-source spectra to define an excitation function
for each type of unvoiced excitation. In addition, these methods can make use
of partial knowledge of the functional forms E, H, or I for purposes of increasing
the accuracy or speed of operation of the processing steps. For example, the
transfer function H is known to contain a term R which describes the lips-to-listener
free space acoustic radiation transfer function. This function R can be removed
from H leaving a simpler function, H*, which is easier to normalize. Similar
knowledge, based on known acoustic physics, and known physiological and mechanical
properties of the vocal organs, can be used to constrain or assist in the coding
and in specific applications.
The Bases of the Methods:
1) The vocalized excitation function of a speaker and the acoustic output from
the speaker are accurately and simultaneously measured using an EM sensor and
a microphone. As one important consequence, the natural opening and closing
of a speaker's glottis can serve as a master timing clock for the definition
of speech time frames.
2) The data from 1) is used to deconvolve the excitation function from the acoustic
output and to obtain the speaker's vocal tract transfer function each speech
time frame.
3) Once the excitation function, the transfer function, and the acoustic function
parameters are determined, the user forms feature vectors that characterize
the speech in each time frame of interest to the degree desired.
4) The formation procedures for the feature vectors are valuable and make possible
new procedures for more accurate, efficient, and economical speech coding, speech
compression, speech recognition, speech synthesis, telephony, speaker identification,
and other related applications.
Models and Coding of Human Speech:
It is common practice in acoustic speech technology as well as in many linear
system applications to use mathematical models of the system. Such models are
used because it is inefficient to retain all of the information measured in
a time-evolving (e.g., acoustic) signal, and because they provide a defining
constraint (e.g., a pattern or functional form) for simplifying or imposing
physical knowledge on the measured data. Users want to employ methods to retain
just enough information to meet the needs of their application and to be compatible
with the limitations of their processing electronics and software. Models fall
into two general categories--linear and non-linear. The methods herein describe
a large number of linear models to process both the EM sensor and the acoustic
information for purposes of speech coding that have not been available to previous
practitioners of speech technology. The methods also include coding using nonlinear
models of speech that are quantifiable by table lookup or by curve fitting,
by perturbation methods, or using more sophisticated techniques relating an
output to an input signal, that also have not been available to users.
The simultaneously obtained acoustic information can also be processed using
well known standard acoustic processing techniques. Procedures for forming feature
vectors using the processed acoustic information are well known. The resulting
feature vector coefficients can be joined with feature vectors coefficients
generated by the EM sensor/acoustic methods described herein.
Vocal system models are generally described by an excitation source which drives
an acoustic resonator tract, from whence the sound pressure wave radiates to
a listener or to a microphone. There are two major types of speech: 1) voiced
where the vocal folds open and close rapidly, at approximately 70 to 200 Hz,
providing periodic bursts of air into the vocal tract, and 2) "unvoiced" excitations
where constrictions in the vocal tract cause air turbulence and associated modified-white
acoustic-noise. (A few sounds are made by both processes at the same time).
The human vocal tract is a complex acoustic-mechanical filter that transforms
the excitation (i.e., noise source or air pressure pulses) into recognizable
sounds, through mostly linear processes. Physically the human acoustic tract
is a series of tubes of different lengths, different area shapes, with side
branch resonator structures, nasal passage connections, and both mid and end
point constrictions. As the excitation pressure wave proceeds from the excitation
source to the mouth (and/or nose), it is constantly being transmitted and reflected
by changes in the tract structure, and the output wave that reaches the lips
(and nose) is strongly modified by the filtering processes. In addition, the
pressure pulses cause the surrounding tissue to vibrate at low levels which
affects the sound as well. It is also known that a backward propagating wave
(i.e. reflecting wave off of vocal tract transitions) does travel backward toward
the vocal folds and the lungs. It is not heard acoustically, but it can influence
the glottal system and it does cause vocal tract tissue to vibrate. Such vibrations
can be measured by an EM sensor used in a microphone mode.
Researchers at Bell Laboratories (Flanagan, Olive, Sondhi and Schroeter ibid.)
and elsewhere have shown that accurate knowledge of the excitation source characteristics
and the associated vocal tract configurations can uniquely characterize a given
acoustic speech unit such as a syllable, phoneme, or more complex unit. This
knowledge can be conveyed by a relatively small set of numbers, which serve
as the coefficients of feature vectors that describe the speech unit over each
speech time frame. They can be generated to meet the degree of accuracy demanded
by the applications. It is also known that if a change in a speech sound occurs,
the speaker has moved one or more speech organs to produce the changed sound.
The methods described herein can be used to detect such changes, to define a
new speech time frame, and to form a new feature vector to describe the new
speech conditions.
The methods for obtaining accurate vocal tract transfer function information
can be used to define coefficients that can be used in the feature vector that
describes the totality of speech tract information for each time frame.
One type of linear model often used to describe the vocal tract transfer function
is an acoustic-tube model (see Sondhi and Schroeter, ibid). A user divides up
the human vocal tract into a large number of tract segments (e.g., 20) and then,
using advanced numerical techniques, the user propagates (numerically) sound
waves from an excitation source to the last tract segment (i.e., the output)
and obtains an output sound. The computer keeps track of all the reflections,
re-reflections, transmissions, resonances, and other propagation features. Experts
find the sound to be acceptable, once all of the parameters defining all the
segments plus all the excitation parameters are obtained.
While this acoustic tube model has been known for many years, the parameters
describing it have been difficult to measure, and essentially impossible to
obtain in real time from a given speaker. The methods herein, describing the
measuring of the excitation function, the acoustic output, and the deconvolving
procedures yields a sufficient number of the parameters needed that the constrictions
and conditions of the physical vocal tract structure model can be described
each time. One-dimensional numerical procedures, based upon time-series techniques,
have been experimentally demonstrated on systems with up to 20 tract segments
to produce accurate models for coding and synthesis.
A second type of linear acoustic model for the vocal tract is based upon electrical
circuit analogies where excitation sources and transfer functions (with poles
and zeros) are commonly used. The corresponding circuit values can be obtained
using measured excitation function, output function, and derived transfer-function
values. Such circuit analog models range from single mesh circuit analogies,
to 20 (or more) mesh circuit models. By defining the model with current representing
volume-air-flow (and voltage representing air pressure), then using capacitors
to represent acoustic tract-section chamber-volumes, inductors to represent
acoustic tract-section air-masses, and resistors to represent acoustic tract-section
air-friction and heat loss values, the user is able to model a vocal tract using
electrical system techniques. Circuit structures (such as T's and/or Pi's) correspond
to the separate structures of the acoustic system, such as tube lengths, tongue
positions, and side resonators of a particular individual. In principle, the
user chooses the circuit constants and structures to meet the complexity requirements
and forms a functional, with unknown parameter values. In practice it has been
easy to define circuit analogs, but very difficult to obtain the values describing
a given individual and even more difficult to measure them in real time. Using
a one mesh model, an electrical analog method has been experimentally validated
for obtaining the information needed to determine the feature vector coefficients
of a human in real time.
A third important model is based upon time series procedures (a type of digital
signal processing) using autoregressive, moving average (ARMA) techniques. This
approach is especially valuable because it characterizes the behavior of a wave
as it traverses a series of transitions in the propagating media. The degree
of the ARMA functional reflects the number of transitions (i.e., constrictions
and other changes) in acoustic tracts used in the model of the individual. Such
a model is also very valuable because it allows the incorporation of several
types of excitation sources, the reaction of the propagating waves on the vocal
tract tissue media itself, and the feedback by backward propagating wave to
the excitation functions. The use of ARMA models has been validated using 14
zeros and 10 poles to form the feature vector for the vocal tract transfer function
of a speaker saying the phoneme /ah/ as well as other sounds.
A fourth method is to use generalized curve fitting procedures to fit data in
tables of the measured excitation-function and acoustic-output processed values.
The process of curve fitting (e.g., using polynomials, LPC procedures, or other
numerical approximations) is to use functional forms that are computationally
well known and that use a limited number of parameters to produce an acceptable
fit to the processed numerical data. Sometimes the functional forms include
partial physical knowledge. These procedures can be used to measure and quantify
arbitrary linear as well as non-linear properties relating the output to the
input.
5) Speech Coding System and Post Processing Units:
The following devices can be used as part of a speech coding system or all together
for a variety of user chosen speech related applications. All of the following
devices, except generic peripherals, are specifically designed to make use of
the present methods and will not operate at full capability without these methods.
a) Telephone receiver/transmitter unit with EM sensors: A unit, chosen for the
application, contains the needed EM sensors, microphone, speaker, and controls
for the application at hand. The internal components of such a telephone-like
unit can include one or more EM sensors, a processing unit, a control unit,
a synthesis unit, and a wireless transmission unit. This unit can be connected
to a more complex system using wireless or transmission line techniques.
b) Control Unit: A specific device that carries out the control intentions of
the user by directing the specific processors to work in a defined way, it directs
the information to the specified processors, it stores the processed data as
directed in short or long term memory, it can transmit the data to another specified
device for special processing, to display units, or to a communications devices
as directed.
c) Speech Coding Unit: A specific type of a coding processor joins information
from an acoustic sensor to vocal organ information from the EM sensor system
(e.g., from vocal fold motions) to generate a series of coefficients that are
formed into a feature vector for each speech time frame. The algorithms to accomplish
these actions are contained therein.
d) Speech Recognizer: Post processing units are used to identify the feature
vectors formed by the speech coding unit for speech recognition applications.
The speech recognition unit matches the feature vector from c) with those in
a pre-constructed library. The other post-processing units associated with recognition
(e.g., spell checkers, grammar checkers, and syntax checkers) are commonly needed
for the speech coding applications.
e) Speech Synthesizer and Speaker: Coded speech can be synthesized into audio
acoustic output. Information, thus coded, can be retrieved from the user's recent
speech, from symbolic information (e.g., ASCII symbol codes) that is converted
into acoustic output, from information transmitted from other systems, and from
system communications with users. Furthermore, the coded speech can be altered
and synthesized into many voices or languages.
f) Speaker Identification: As part of the post processing, the idiosyncratic
speech and organ motion characteristics of each speaker can be analyzed and
compared in real time. The comparison is to known records of the speaker's physical
speech organ motions, shapes, and language usage properties for a sequence of
words. The EM sensor information adds a new dimension of sophistication in the
identification process that is not possible using acoustic speech alone.
g) Encryption Units: Speech coded by the procedures herein can be further coded
(i.e., encrypted) in various ways to make them difficult to use by other than
an authorized user. The methods described herein allow the user to code speech,
with such a low bandwidth requirement, that encryption information can be added
to the transmitted speech signal without requiring additional bandwidth beyond
what is normally used.
h) Display Units: Computer rendered speech information must be made available
to the user for a variety of applications. A video terminal is used to show
the written word rendition of the spoken words, graphical renditions of the
information, (e.g., the articulators in a vocal tract), a speaker is used to
play previously recorded and coded speech to the user. The information can be
displayed by printed using printers or fax machines.
i) Hand Control Units: Hand control units can assist in the instruction of the
system being spoken to. The advantage of a hand control unit (similar to a "mouse")
is that it can assist in communicating or correcting the type of speech being
inputted. Examples are to distinguish control instructions from data inputting,
to assist in editing by directing a combined speech-hand-directed cursor to
increase the speed of identifying displayed text segments, to increase the certainty
of control by the user, to elicit play-back of desired synthesized phrases,
to request vocal tract pictures of the speakers articulator positions for language
correction, etc.
j) Language Recognizer and Translator Unit: As the speaker begins to talk into
a microphone, this device codes the speech and characterizes the measured series
of phonemes as to the language to which they belong. The system can request
the user to pronounce known words which are identified, or the system can use
statistics of frequent word sound patterns to conduct a statistical search through
the codebooks for each language.
It is also convenient to use this same unit, and the procedures described herein,
to accept speech recognized words from one language and to translate the symbols
for the same words into the speech synthesis codes for the second language.
The user may implement control commands requesting the speaker to identify the
languages to be used. Alternatively, the automatic language identification unit,
can use the statistics of the language, to identify the languages from which
and to which the translations are to take place. The translator then performs
the translation to the second desired language, by using the speech unit codes,
and associated speech unit symbols, that the system generates while the first
language is spoken. The speech codes, generated by the translator, are then
converted into symbols or into synthesized speech in the desired second language.
k) Peripheral Units: Many peripheral units can be attached to the system as
needed by the user making possible new capabilities. As an example, an auxiliary
instrument interface unit allows the connection of instruments, such as a video
camera, that require synchronization with the acoustic speech and speech coding.
A communications link is very useful because it provides wireless or transmission
line interfacing and communication with other systems. A keyboard is used to
interface with the system in a conventional way, but also to direct speech technology
procedures. Storage units such as disks, tape drives, semiconductor memories
are used to hold processed results or, during processing, for temporary storage
of information needed.
--------------------------------------------
The invention also includes telephone
coding using identification procedures where the speech recognition results
in a word identification. The word character computer code (e.g. ASCII) is transmitted
along with none or minimal speaker voice characterization information for the
purpose of minimizing the bandwidth of transmission. Word (i.e., language symbols
such as letters, pictograms, and other symbols) transmission is known to be
about 100 fold less demanding of transmission bandwidth than present speech
telephony; thus the value of this transmission is very high.
The methods include communication feedback to a user for many applications because
the physiological as well as acoustic information is accurately coded and available
for display or feedback. For speech correction or for foreign language learning,
displays of the vocal organs show organ mispositioning by the speaker. For deaf
speaker's, mis-articulated sounds are identified and fed back using visual,
tactile, or electrical stimulus units.
Changes and modifications in the specifically
described embodiments can be carried out without departing from the scope of
the invention which is intended to be limited only by the scope of the appended
claims.
Comments