Patent No. 5729694 Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

Patent No. 5729694

Speech coding, reconstruction and recognition using acoustics and electromagnetic waves (Holzrichter, et al., Mar 17, 1998)

Abstract

The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

Notes:

Government Interests

The United States Government has rights in this invention pursuant to Contract No. W-7405-ENG-48 between the United States Department of Energy and the University of California for the operation of Lawrence Livermore National Laboratory.

BACKGROUND OF THE INVENTION

The invention relates generally to the characterization of human speech using combined EM wave information and acoustic information, for purposes of speech coding, speech recognition, speech synthesis, speaker identification, and related speech technologies.

Speech Characterization and Coding:

The history of speech characterization, coding, and generation has spanned the last one and one half centuries. Early mechanical speech generators relied upon using arrays of vibrating reeds and tubes of varying diameters and lengths to make human-voice-like sounds. The combinations of excitation sources (e.g., reeds) and acoustic tracts (e.g., tubes) were played like organs at theaters to mimic human voices. In the 20th century, the physical and mathematical descriptions of the acoustics of speech began to be studied intensively and these were used to enhance many commercial products such as those associated with telephony and wireless communications. As a result, the coding of human speech into electrical signals for the purposes of transmission was extensively developed, especially in the United States at the Bell Telephone Laboratories. A complete description of this early work is given by J. L. Flanagan, in "Speech Analysis, Synthesis, and Perception", Academic Press, New York, 1965. He describes the physics of speech and the mathematics of describing acoustic speech units (i.e., coding). He gives examples of how human vocal excitation sources and the human vocal tracts behave and interact with each other to produce human speech.

The commercial intent of the early telephone work was to understand how to use the minimum bandwidth possible for transmitting acceptable vocal quality on the then-limited number of telephone wires and on the limited frequency spectrum available for radio (i.e. wireless) communication. Secondly, workers learned that analog voice transmission uses typically 100 times more bandwidth than the transmission of the same word if simple numerical codes representing the speech units such as phonemes or words are transmitted. This technology is called "Analysis-Synthesis Telephony" or "Vocoding". For example, sampling at 8 kHz and using 16 bits per analog signal value requires 128 kbps, but the Analysis Synthesis approach can lower the coding requirements to below 1.0 kbps. In spite of the bandwidth advantages, vocoding has not been used widely because it requires accurate automated phoneme coding and resynthesis; otherwise the resulting speech tends to have a "machine accent" and be of limited intelligibility. One major aspect of the difficulty of speech coding is adequacy of the excitation information, including the pitch measurement, the voiced-unvoiced discrimination, and the spectrum of the glottal excitation pulse.

Progress in speech acoustical understanding and mathematical modeling of the vocal tract has continued and become quite sophisticated, mostly in the laboratory. It is now reasonably straightforward to simulate human speech by using differential equations which describe the increasingly complex concatenations of sound excitation sources, vocal tract tubes, and their constrictions and side branches (e.g., vocal resonators). Transform methods (e.g. electrical analogies solved by Fourier, Laplace, Z-transforms, etc.) are used for simpler cases and sophisticated computational modeling on supercomputers for increasingly complex and accurate simulations. See Flanagan (ibid.) for early descriptions of modeling, and Schroeter and Sondhi, "A hybrid time-frequency domain articulator speech synthesizer", IEEE Trans. on Acoustic Speech, ASSP 35(7) 1987 and "Techniques for Estimating Vocal-Tract Shapes from the Speech Signal", ASSP 2(1), 1343, 1994. These papers reemphasize that it is not possible to work backwards from the acoustic output to obtain a unique mathematical description of the combined vocal fold-vocal tract system, which is called the "inverse problem" herein. It is not possible to obtain information that separately describes both the "zeros" in speech air flow caused by glottal (i.e., vocal fold) closure and those caused by closed, or resonant structures in the vocal tract. As a result, it is not possible to use the well developed mathematics of modern signal acquisition, processing, coding, and reconstructing to the extent needed.

In addition, given a mathematical vocal system model, it remains especially difficult to associate it with a unique individual because it is very difficult to obtain the detailed physiological vocal tract features of a given individual such as tract lengths, diameters, cross sectional shapes, wall compliance, sinus size, glottal size and compliance, lung air pressure, and other necessary parameters. In some cases, deconvolving the excitation source from the acoustic output can be done for certain sounds where the "zeros" are known to be absent, so the major resonant structures such as tract lengths can be determined. For example, simple acoustic resonator techniques (see the 1976 U.S. Pat. No. 4,087,632 by Hafer) are used to derive the tongue body position by measuring the acoustic formant frequencies (i.e., the vocal tube resonance frequencies) and to constrain the tongue locations and tube lengths against an early, well known vocal tract model by Coker, "A Model of Articulatory Dynamics and Control", Proc. of IEEE, Vol. 64(4), 452-460, 1976. The problem with this approach is that only gross dimensions of the tract are obtained, but detailed vocal tract features are needed to unambiguously define the physiology of the human doing the speaking. For more physiological details, x-ray imaging of the vocal tract has been used to obtain tube lengths, diameters, and resonator areas and structures. Also the optical laryngoscope, inserted into the throat, to view the vocal fold open and close cycles, is used in order to observe their sizes and time behavior.

The limit to further performance improvements in acoustic speech recognition, in speech synthesis, in speaker identification, and other related technologies is directly related to our inability to accurately solve the inverse problem. Present workers are unable to use acoustic speech output to work backwards to accurately and easily determine the vocal tract transfer function, as well as the excitation amplitude versus time. The "missing" information about the separation of the excitation function from the vocal tract transfer function leads to many difficulties in automating the coding of the speech for each speech time frame and in forming speech sound-unit libraries for speech-related technologies. A major reason for the problem is that workers have been unable to measure the excitation function in real time. This has made it difficult to automatically identify the start and stop of each voiced speech segments over which a speech sound unit is constant. This has made it difficult to join (or to unjoin) the transitions between sequential vocalized speech units (e.g., syllables, phonemes or multiplets of phonemes) as an individual human speaker articulates sounds at rates of approximately 10 phonemes per second or two words per second.

The lack of precision in speech segment identification adds to the difficulty in obtaining accurate model coefficients for both the excitation function and the vocal tract. Further, this leads to inefficiencies in the algorithms and the computational procedures required by the technological application such as speech recognition. In addition, the difficulties described above prevent the accurate coding of the unique acoustic properties of a given individual for personalized, human speech synthesis or for pleasing vocoding. In addition, the "missing" information prevents complete separation of the excitation from the transfer function, and limits accurate speaker-independent speech-unit coding (speaker normalization). The incomplete normalization limits the ability to conduct accurate and rapid speech recognition and/or speaker identification using statistical codebook lookup techniques, because the variability of each speaker's articulation adds uncertainty in the matching process and requires additional statistical processing. The missing information and the timing difficulties also inhibit the accurate handling of co-articulation, incomplete articulation, and similar events where words are run together in the sequences of acoustic units comprising a speech segment.

In the 1970s, workers in the field of speech recognition showed that short "frames" (e.g., 10 ms intervals) of the time waveform of a speech signal could be well approximated by an all poles (but no zeros) analytic representation, using numerical "linear predictive coding" (LPC) coefficients found by solving covariance equations. Specific procedures are described in B. S. Atal and S. L. Hanauer, "Speech analysis and synthesis by linear prediction of the speech wave", J. Acoust. Soc. Am. 50(2), pp. 63, 1971. The LPC coefficients are a form of speech coding and have the advantage of characterizing acoustic speech with a relatively small number of variables-typically 20 to 30 per frame as implemented in today's systems. They make possible statistical table look up of large numbers of word representations using Hidden Markov techniques for speech recognition.

In speech synthesizers, code books of acoustic coefficients (e.g., using well known LPC, PARCOR, or similar coefficients) for each of the phonemes and for a sufficient number of diphonemes (i.e. phoneme pairs) are constructed. Upon demand from text-to-speech generators, they are retrieved and concatenated to generate synthetic speech. However, as an accurate coding technique, they only approximate the speech frames they represent. Their formation and use is not based upon using knowledge of the excitation function, and as a result they do not accurately describe the condition of the articulators. They are also inadequate for reproducing the characteristics of the given human speaker. They do not permit natural concatenation into high quality natural speech. They can not be easily related to an articulatory speech model to obtain speaker-specific physiological parameters. Their lack of association with the articulatory configuration makes it difficult to do speaker normalization, as well as to deal with the coarticulation and incomplete articulation problem of natural speech.

Present Example of Speech Coding:

Rabiner, in "Applications of Voice Processing to Telecommunications" Proc. of the IEEE 82, 199 February 1994 points out that several modern text-to-speech synthesis systems in use today by AT&T use 2000 to 4000 diphonemes, which are needed to simulate the phoneme-tophoneme transitions in the concatenation process for natural speech sounds. FIG. 1 shows a prior art open loop acoustic speech coding system in which acoustic signals from a microphone are processed, e.g. by LPC, and feature vectors are produced and stored in a library. Rabiner also points out (page 213) that in current synthesis models, the vocal source excitation and the vocal tract interaction "is grossly inadequate", and also that "when natural duration and pitch are copied onto a text-to-speech utterance, . . . the quality of the . . . synthetic speech improves dramatically." Presently, it is not possible to economically capture the natural pitch duration and voiced air-pulse amplitude vs. time, as well as individual vocal tract qualities, of a given individual's voice in any of the presently used models, except by very expensive and invasive laboratory measurements and computations.

J. L. Flanagan, "Technologies for Multimedia Communications", Proc. IEEE 82, 590, April 1994, describes low bandwidth speech coding: "At fewer than 1 bit per Nyquist sample, source coding is needed to additionally take into account the properties of the signal generator (such as voiced/unvoiced distinctions in speech, and pitch, intensity, and formant characteristics)." There is no presently, commercially useful method to account for the speech excitation source in order to minimize the coding complexity and subsequent bandwidth.

EM Sensors and Acoustic Information:

The use of EM sensors for measuring speech organ conditions for the purposes of speech recognition and related technologies are described in copending U.S. patent application Ser. No. 08/597,596 by Holzrichter. Although it has been recognized for many decades in the field of speech recognition that speech organ position and motion information could be useful, and EM sensors (e.g. rf and microwave radars) were available to do the measurement, no one had suggested a system using such sensors to detect the motions and locations of speech organs. Nor had anyone described how to use this information to code each speech unit and to use the code in an algorithm to identify the speech unit, or for other speech technology applications such as synthesis. Holzrichter showed how to use EM sensor information with simultaneously obtained acoustic data to obtain the positions of vocal organs, how to define feature vectors from this organ information to use as a coding technique, and how to use this information to do high-accuracy speech recognition. He also pointed out that this information provided a natural method of defining changes in each phoneme by measuring changes in the vocal organ conditions, and he described a method to automatically define each speech time frame. He also showed that "photographic quality" EM wave images, obtained by tomographic or similar techniques, were not necessary for the implementation of the procedures he described, nor for the procedures described herein.

SUMMARY OF THE INVENTION

Accordingly it is an object of the invention to provide method and apparatus for speech coding using nonacoustic information in combination with acoustic information.

It is also an object of the invention to provide method and apparatus for speech coding using Electromagnetic (EM) wave generation and detection modules in combination with acoustic information.

It is also an object of the invention to provide method and apparatus for speech coding using radar in combination with acoustic information.

It is another object of the invention to use micropower impulse radar in conjunction with acoustic information for speech coding.

It is another object of the invention to use the methods and apparatus provided for speech coding for the purposes of speech recognition, mathematical approximation, information storage, speech compression, speech synthesis, vocoding, speaker identification, prosthesis, language teaching, speech correction, language identification, and other speech related applications.

The invention is a method and apparatus for ioining nonacoustic and acoustic data. Nonacoustic information describing speech organs is obtained using Electromagnetic (EM) waves such as RF waves, microwaves, millimeter waves, infrared or optical waves at wavelengths that reach the speech organs for measurement. Their information is combined with conventional acoustic information measured with a microphone. They are combined, using a deconvolving algorithm, to produce more accurate speech coding than obtainable using only acoustic information. The coded information, representing the speech, is then available for speech technology applications such as speech compression, speech recognition, speaker recognition, speech synthesis, and speech telephony (i.e., vocoding).

Simultaneously obtained EM sensor and acoustic information are used to define a time frame and to obtain the details of a human speaker's excitation function and vocal tract function for each speech time frame. The methods make available the formation of numerical feature vectors for characterizing the acoustic speech unit spoken each speech time frame. This makes possible a new method of speech characterization (i.e., coding) using a more complete and accurate set of information than has been available to previous workers. Such coding can be used for purposes of more accurate and more economical speech recognition, speech compression, speech synthesis, vocoding, speaker identification, teaching, prosthesis, and other applications.

The present invention enables the user to obtain the transfer function of the human speech system for each speech time frame defined using the methods herein. In addition, the present invention includes several algorithmic methods of coding (i.e., numerically describing) these functions for valuable applications in speech recognition, speech synthesis, speaker identification, speech transmission, and many other applications. The coding system, described herein, can make use of much of the apparatus and data collection techniques described in the copending patent application Ser. No. 08/547,596, including EM wave generation, transmission, and detection, as well as data averaging, and data storage algorithms. The procedures defined in the copending patent application are called NASR or Non Acoustic Speech Recognition. Procedures based upon acoustic prior art are called CASR for Conventional Acoustic Speech Recognition, and these procedures are also used herein to provide processed acoustic information.

The following terms are used herein. An acoustic speech unit is the single or multiple sound utterance that is being described, recognized, or synthesized using the methods herein. Examples include syllables, demi-syllables, phonemes, phone-like speech units (i.e., PLUs), diphones, triphones, and more complex sound sequences such as words. Phoneme acoustic-speech-units are used for most of the speech unit examples herein. A speech frame is a time during which speech organ conditions (including repetitive motions of the vocal folds) and the acoustic output remain constant within pre-defined values that define the constancy. Multiple time frames are a sequence of time frames joined together in order to describe changes in acoustic or speech organ conditions as time progresses. A speech period, or pitch period is the time the glottis is open and the time it is closed until the next glottal cycle begins, which include transitions to unvoiced speech or to silence. A speech segment is a period of time of sounded speech that is being processed using the methods herein. Glottal tissue includes vocal fold tissue and surrounding tissue, and glottal open/close cycles are the same as vocal fold open/close cycles. The word functional, as used herein, means a mathematical function with both variables and symbolic parametercoefficients, whereas the word function means a functional with defined numerical parameter-coefficients.

The present methods and apparatus work for all human speech sounds and languages, as well as for animal sounds generated by vocal organ motions detectable by EM sensors and processed as described. The examples are based on, but not limited to American English speech.

1) EM Sensor Generator:

All configurations of EM wave generation and detection modules that meet the requirements for frequency, timing, pulse format, tissue transmission, and power (and safety) can be used. EM wave generators may be used which, when related to the distance from the antenna(s), operate in the EM near-field mode (mostly non-radiating), in the intermediate-EM-field mode where the EM wave is both non-radiating and radiating, and in the radiating far-field mode (i.e. most radars). EM waves in several wavelength-bands between <10.sup.8 to >10.sup.14 Hz can penetrate tissue and be used as described herein. A particular example is a wide-band microwave EM generator impulse radar, radiating 2.5 GHz signals and repeating its measurement at a 2 MHz pulse repetition rate, which penetrates over 10 cm into the head or neck. Such units have been used with appropriate algorithms to validate the methods. These units have been shown to be economical and safe for routine human use. The speech coding experiments have been conducted using EM wave transmit/receive units (i.e., impulse radars) in two different configurations. In one configuration, glottal open-close information, together with simultaneous acoustic speech information, was obtained using one microphone and one radar unit. In a second set of experiments, three EM sensor units and one acoustic unit were used. In addition, a particular method is described for improving the accuracy of transmitting and receiving an electromagnetic wave into the head and neck, for very high accuracy excitation function descriptions.

2) EM Sensor Detector:

Many different EM wave detector modes have been demonstrated for the purpose of obtaining nonacoustic speech organ information. A multiple pulse, fixed-range-gate reception system (i.e., field disturbance mode) has been used for vocal fold motion and nearby tissue motion detection. Other techniques have been used to determine the positions of other vocal organs to obtain added information on the condition of the vocal tract. Many other systems are described in the radar literature on EM wave detection, and can be employed.

3) Configuration structures and Control System:

Many different control techniques for portable and fixed EM sensor/acoustic systems can be used for the purposes of speech coding. However, the processing procedures described herein may require additional and different configurations and control systems. For example, in applications such as high fidelity, "personalized" speech synthesis, extra emphasis must be placed on the quality of the instrumentation, the data collection, and the sound unit parsing. The recording environments, the instrumentation linearity, the dynamic range, the relative timing of the sensors (e.g. acoustic propagation time from the glottis to the microphone), the A/D converter accuracy, the processing algorithms' speed and accuracy, and the qualities of play back instrumentation are all very important.

4) Processing Units and Algorithms:

For each set of received EM signals and acoustic signals there is a need to process and extract the information on organ positions (or motions) and to use the coded speech sounds for the purposes of deconvolving the excitation from the acoustic output, and for tract configuration identification. For example, information on the positions of the vocal folds (and therefore the open area for air flow) vs. time is obtained by measuring the reflected EM waves as a function of time. Similarly, information on the conditions of the lips, jaw, teeth, tongue, and vellum positions can be obtained by transmitting EM waves from other directions and using other pulse formats. The reflected and received signals from the speech organs are stored in a memory and processed every speech time frame, as defined below. The reflected EM signals can be digitized, averaged, and normalized, as a function of time, and feature vectors can be formed.

The present invention uses EM sensor data to automatically define a speech time frame using the number of times that the glottis opens and closes for vocalized speech, while the conditions of other speech organs and the acoustics remain substantially constant. The actual speech time frame interval used for the processing (for either coding or reconstructing) can be adapted to optimize the data processing. The interval can be described by one or several constant single pitch periods, by a single pitch period value and a multiplier describing the number of substantially identical periods over which little sound change occurs, or it can use the pitch periods to describe a time interval of essentially constant speech but with "slowly changing" organ or acoustic conditions. The basic glottal-period timing-unit serves as a master timing clock. The use of glottal periods for master timing makes possible an automated speech and vocal organ information processing system for coding spoken speech, for speech compression, for speaker identification, for obtaining training data, for codebook or library generation, for synchronization with other instruments, and for other applications. This method of speech frame definition is especially useful for defining diphones and higher order multiple sound acoustic speech units, for time compression and alignment, for speaker speech rate normalization, and for prosody parameter definition and implementation. Timing can also be defined for unvoiced speech, similarly to the procedures used for vocalized speech.

Once a speech time frame is defined, the user deconvolves the acoustic excitation function from the acoustic output function. Both are simultaneously measured over the defined time frame. Because the mathematical problems of "invertability" are overcome, much more accurate and efficient coding occurs compared to previous methods. By measuring the human excitation source function in real time, including the time during which the vocal folds are closed and the airflow stops (i.e., the glottal "zeros"), accurate approximations of these very important functional shapes can be employed to model each speech unit. As a result of this new capability to measure the excitation function, the user can employ very accurate, efficient digital signal processing techniques to deconvolve the excitation function from the acoustic speech output function. For the first time, the user is able to accurately and completely describe the human vocal tract transfer function for each speech unit.

There are three speech functions that describe human speech: E(t)=excitation function, H(t)=transfer function, and I(t)=output acoustics function. The user can determine any one of these three speech functions by knowing the two other functions. The human vocal system operates by generating an excitation function, E(t), which produces rapidly pulsating air flow (or air pressure pulses) vs. time. These (acoustic) pulses are convolved with (or filtered by) the vocal tract transfer function, H(t), to obtain a sound output, I(t). Being able to measure, conveniently in real time, the input excitation E and the output I, makes it possible to use linear mathematical processing techniques to deconvolve E from I. This procedure allows the user to obtain an accurate numerical description of the speaker's transfer function H. This method conveniently leads to a numerical Fourier transform of the function H, which is represented as a complex amplitude vs. frequency. A time domain function is also obtainable. These numerical functions for H can be associated with model functions, or can be stored in tabular form, in several ways. The function H is especially useful because it describes, in detail, each speaker's vocal tract acoustical system and it plays a dominant role in defining the individualized speech sounds being spoken.

Secondly, a synthesized output acoustic function, I(t), can be produced by convolving the voiced excitation function, E(t), with the transfer function, H(t), for each desired acoustic speech unit. Thirdly, the excitation function, E, can be determined by deconvolving a previously obtained transfer function, H, from a measured acoustic output function, I. This third method is useful to obtain the modified-white-noise excitation-source spectra to define an excitation function for each type of unvoiced excitation. In addition, these methods can make use of partial knowledge of the functional forms E, H, or I for purposes of increasing the accuracy or speed of operation of the processing steps. For example, the transfer function H is known to contain a term R which describes the lips-to-listener free space acoustic radiation transfer function. This function R can be removed from H leaving a simpler function, H*, which is easier to normalize. Similar knowledge, based on known acoustic physics, and known physiological and mechanical properties of the vocal organs, can be used to constrain or assist in the coding and in specific applications.

The Bases of the Methods:

1) The vocalized excitation function of a speaker and the acoustic output from the speaker are accurately and simultaneously measured using an EM sensor and a microphone. As one important consequence, the natural opening and closing of a speaker's glottis can serve as a master timing clock for the definition of speech time frames.

2) The data from 1) is used to deconvolve the excitation function from the acoustic output and to obtain the speaker's vocal tract transfer function each speech time frame.

3) Once the excitation function, the transfer function, and the acoustic function parameters are determined, the user forms feature vectors that characterize the speech in each time frame of interest to the degree desired.

4) The formation procedures for the feature vectors are valuable and make possible new procedures for more accurate, efficient, and economical speech coding, speech compression, speech recognition, speech synthesis, telephony, speaker identification, and other related applications.

Models and Coding of Human Speech:

It is common practice in acoustic speech technology as well as in many linear system applications to use mathematical models of the system. Such models are used because it is inefficient to retain all of the information measured in a time-evolving (e.g., acoustic) signal, and because they provide a defining constraint (e.g., a pattern or functional form) for simplifying or imposing physical knowledge on the measured data. Users want to employ methods to retain just enough information to meet the needs of their application and to be compatible with the limitations of their processing electronics and software. Models fall into two general categories--linear and non-linear. The methods herein describe a large number of linear models to process both the EM sensor and the acoustic information for purposes of speech coding that have not been available to previous practitioners of speech technology. The methods also include coding using nonlinear models of speech that are quantifiable by table lookup or by curve fitting, by perturbation methods, or using more sophisticated techniques relating an output to an input signal, that also have not been available to users.

The simultaneously obtained acoustic information can also be processed using well known standard acoustic processing techniques. Procedures for forming feature vectors using the processed acoustic information are well known. The resulting feature vector coefficients can be joined with feature vectors coefficients generated by the EM sensor/acoustic methods described herein.

Vocal system models are generally described by an excitation source which drives an acoustic resonator tract, from whence the sound pressure wave radiates to a listener or to a microphone. There are two major types of speech: 1) voiced where the vocal folds open and close rapidly, at approximately 70 to 200 Hz, providing periodic bursts of air into the vocal tract, and 2) "unvoiced" excitations where constrictions in the vocal tract cause air turbulence and associated modified-white acoustic-noise. (A few sounds are made by both processes at the same time).

The human vocal tract is a complex acoustic-mechanical filter that transforms the excitation (i.e., noise source or air pressure pulses) into recognizable sounds, through mostly linear processes. Physically the human acoustic tract is a series of tubes of different lengths, different area shapes, with side branch resonator structures, nasal passage connections, and both mid and end point constrictions. As the excitation pressure wave proceeds from the excitation source to the mouth (and/or nose), it is constantly being transmitted and reflected by changes in the tract structure, and the output wave that reaches the lips (and nose) is strongly modified by the filtering processes. In addition, the pressure pulses cause the surrounding tissue to vibrate at low levels which affects the sound as well. It is also known that a backward propagating wave (i.e. reflecting wave off of vocal tract transitions) does travel backward toward the vocal folds and the lungs. It is not heard acoustically, but it can influence the glottal system and it does cause vocal tract tissue to vibrate. Such vibrations can be measured by an EM sensor used in a microphone mode.

Researchers at Bell Laboratories (Flanagan, Olive, Sondhi and Schroeter ibid.) and elsewhere have shown that accurate knowledge of the excitation source characteristics and the associated vocal tract configurations can uniquely characterize a given acoustic speech unit such as a syllable, phoneme, or more complex unit. This knowledge can be conveyed by a relatively small set of numbers, which serve as the coefficients of feature vectors that describe the speech unit over each speech time frame. They can be generated to meet the degree of accuracy demanded by the applications. It is also known that if a change in a speech sound occurs, the speaker has moved one or more speech organs to produce the changed sound. The methods described herein can be used to detect such changes, to define a new speech time frame, and to form a new feature vector to describe the new speech conditions.

The methods for obtaining accurate vocal tract transfer function information can be used to define coefficients that can be used in the feature vector that describes the totality of speech tract information for each time frame.

One type of linear model often used to describe the vocal tract transfer function is an acoustic-tube model (see Sondhi and Schroeter, ibid). A user divides up the human vocal tract into a large number of tract segments (e.g., 20) and then, using advanced numerical techniques, the user propagates (numerically) sound waves from an excitation source to the last tract segment (i.e., the output) and obtains an output sound. The computer keeps track of all the reflections, re-reflections, transmissions, resonances, and other propagation features. Experts find the sound to be acceptable, once all of the parameters defining all the segments plus all the excitation parameters are obtained.

While this acoustic tube model has been known for many years, the parameters describing it have been difficult to measure, and essentially impossible to obtain in real time from a given speaker. The methods herein, describing the measuring of the excitation function, the acoustic output, and the deconvolving procedures yields a sufficient number of the parameters needed that the constrictions and conditions of the physical vocal tract structure model can be described each time. One-dimensional numerical procedures, based upon time-series techniques, have been experimentally demonstrated on systems with up to 20 tract segments to produce accurate models for coding and synthesis.

A second type of linear acoustic model for the vocal tract is based upon electrical circuit analogies where excitation sources and transfer functions (with poles and zeros) are commonly used. The corresponding circuit values can be obtained using measured excitation function, output function, and derived transfer-function values. Such circuit analog models range from single mesh circuit analogies, to 20 (or more) mesh circuit models. By defining the model with current representing volume-air-flow (and voltage representing air pressure), then using capacitors to represent acoustic tract-section chamber-volumes, inductors to represent acoustic tract-section air-masses, and resistors to represent acoustic tract-section air-friction and heat loss values, the user is able to model a vocal tract using electrical system techniques. Circuit structures (such as T's and/or Pi's) correspond to the separate structures of the acoustic system, such as tube lengths, tongue positions, and side resonators of a particular individual. In principle, the user chooses the circuit constants and structures to meet the complexity requirements and forms a functional, with unknown parameter values. In practice it has been easy to define circuit analogs, but very difficult to obtain the values describing a given individual and even more difficult to measure them in real time. Using a one mesh model, an electrical analog method has been experimentally validated for obtaining the information needed to determine the feature vector coefficients of a human in real time.

A third important model is based upon time series procedures (a type of digital signal processing) using autoregressive, moving average (ARMA) techniques. This approach is especially valuable because it characterizes the behavior of a wave as it traverses a series of transitions in the propagating media. The degree of the ARMA functional reflects the number of transitions (i.e., constrictions and other changes) in acoustic tracts used in the model of the individual. Such a model is also very valuable because it allows the incorporation of several types of excitation sources, the reaction of the propagating waves on the vocal tract tissue media itself, and the feedback by backward propagating wave to the excitation functions. The use of ARMA models has been validated using 14 zeros and 10 poles to form the feature vector for the vocal tract transfer function of a speaker saying the phoneme /ah/ as well as other sounds.

A fourth method is to use generalized curve fitting procedures to fit data in tables of the measured excitation-function and acoustic-output processed values. The process of curve fitting (e.g., using polynomials, LPC procedures, or other numerical approximations) is to use functional forms that are computationally well known and that use a limited number of parameters to produce an acceptable fit to the processed numerical data. Sometimes the functional forms include partial physical knowledge. These procedures can be used to measure and quantify arbitrary linear as well as non-linear properties relating the output to the input.

5) Speech Coding System and Post Processing Units:

The following devices can be used as part of a speech coding system or all together for a variety of user chosen speech related applications. All of the following devices, except generic peripherals, are specifically designed to make use of the present methods and will not operate at full capability without these methods.

a) Telephone receiver/transmitter unit with EM sensors: A unit, chosen for the application, contains the needed EM sensors, microphone, speaker, and controls for the application at hand. The internal components of such a telephone-like unit can include one or more EM sensors, a processing unit, a control unit, a synthesis unit, and a wireless transmission unit. This unit can be connected to a more complex system using wireless or transmission line techniques.

b) Control Unit: A specific device that carries out the control intentions of the user by directing the specific processors to work in a defined way, it directs the information to the specified processors, it stores the processed data as directed in short or long term memory, it can transmit the data to another specified device for special processing, to display units, or to a communications devices as directed.

c) Speech Coding Unit: A specific type of a coding processor joins information from an acoustic sensor to vocal organ information from the EM sensor system (e.g., from vocal fold motions) to generate a series of coefficients that are formed into a feature vector for each speech time frame. The algorithms to accomplish these actions are contained therein.

d) Speech Recognizer: Post processing units are used to identify the feature vectors formed by the speech coding unit for speech recognition applications. The speech recognition unit matches the feature vector from c) with those in a pre-constructed library. The other post-processing units associated with recognition (e.g., spell checkers, grammar checkers, and syntax checkers) are commonly needed for the speech coding applications.

e) Speech Synthesizer and Speaker: Coded speech can be synthesized into audio acoustic output. Information, thus coded, can be retrieved from the user's recent speech, from symbolic information (e.g., ASCII symbol codes) that is converted into acoustic output, from information transmitted from other systems, and from system communications with users. Furthermore, the coded speech can be altered and synthesized into many voices or languages.

f) Speaker Identification: As part of the post processing, the idiosyncratic speech and organ motion characteristics of each speaker can be analyzed and compared in real time. The comparison is to known records of the speaker's physical speech organ motions, shapes, and language usage properties for a sequence of words. The EM sensor information adds a new dimension of sophistication in the identification process that is not possible using acoustic speech alone.

g) Encryption Units: Speech coded by the procedures herein can be further coded (i.e., encrypted) in various ways to make them difficult to use by other than an authorized user. The methods described herein allow the user to code speech, with such a low bandwidth requirement, that encryption information can be added to the transmitted speech signal without requiring additional bandwidth beyond what is normally used.

h) Display Units: Computer rendered speech information must be made available to the user for a variety of applications. A video terminal is used to show the written word rendition of the spoken words, graphical renditions of the information, (e.g., the articulators in a vocal tract), a speaker is used to play previously recorded and coded speech to the user. The information can be displayed by printed using printers or fax machines.

i) Hand Control Units: Hand control units can assist in the instruction of the system being spoken to. The advantage of a hand control unit (similar to a "mouse") is that it can assist in communicating or correcting the type of speech being inputted. Examples are to distinguish control instructions from data inputting, to assist in editing by directing a combined speech-hand-directed cursor to increase the speed of identifying displayed text segments, to increase the certainty of control by the user, to elicit play-back of desired synthesized phrases, to request vocal tract pictures of the speakers articulator positions for language correction, etc.

j) Language Recognizer and Translator Unit: As the speaker begins to talk into a microphone, this device codes the speech and characterizes the measured series of phonemes as to the language to which they belong. The system can request the user to pronounce known words which are identified, or the system can use statistics of frequent word sound patterns to conduct a statistical search through the codebooks for each language.

It is also convenient to use this same unit, and the procedures described herein, to accept speech recognized words from one language and to translate the symbols for the same words into the speech synthesis codes for the second language. The user may implement control commands requesting the speaker to identify the languages to be used. Alternatively, the automatic language identification unit, can use the statistics of the language, to identify the languages from which and to which the translations are to take place. The translator then performs the translation to the second desired language, by using the speech unit codes, and associated speech unit symbols, that the system generates while the first language is spoken. The speech codes, generated by the translator, are then converted into symbols or into synthesized speech in the desired second language.

k) Peripheral Units: Many peripheral units can be attached to the system as needed by the user making possible new capabilities. As an example, an auxiliary instrument interface unit allows the connection of instruments, such as a video camera, that require synchronization with the acoustic speech and speech coding. A communications link is very useful because it provides wireless or transmission line interfacing and communication with other systems. A keyboard is used to interface with the system in a conventional way, but also to direct speech technology procedures. Storage units such as disks, tape drives, semiconductor memories are used to hold processed results or, during processing, for temporary storage of information needed.

--------------------------------------------

The invention also includes telephone coding using identification procedures where the speech recognition results in a word identification. The word character computer code (e.g. ASCII) is transmitted along with none or minimal speaker voice characterization information for the purpose of minimizing the bandwidth of transmission. Word (i.e., language symbols such as letters, pictograms, and other symbols) transmission is known to be about 100 fold less demanding of transmission bandwidth than present speech telephony; thus the value of this transmission is very high.

The methods include communication feedback to a user for many applications because the physiological as well as acoustic information is accurately coded and available for display or feedback. For speech correction or for foreign language learning, displays of the vocal organs show organ mispositioning by the speaker. For deaf speaker's, mis-articulated sounds are identified and fed back using visual, tactile, or electrical stimulus units.

Changes and modifications in the specifically described embodiments can be carried out without departing from the scope of the invention which is intended to be limited only by the scope of the appended claims.

Search

Physics of Sound and Vibration: Effects On Human Electro-Dynamic Physiology

Patent No. 5729694 Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

Comments

Labels

Popular Posts - Last 30 days

Secret Testing - EM-Weapon Through Satellite

“Before the Storm: Hurricane Florence" - Organ Improvisation

Video: New Brain Computer Interface Technology - Steve Hoffman | TEDxCEIBS

Truth About The Covid Plandemic - (Video Compilation)

Human Hacking - Advanced Technologies Revealed

Cell Towers and Targeted Individuals - Free E-Book!

One Nation Formerly Under God - CIA & DARPA - U.S. Genocide In Progress (Part 5)

The Matrix Deciphered - by Robert Duncan

DNA Frequency Bioweapon Links Targeted Individuals to Artificial Intelligence Hive Mind Control Grid

RFID Human Implants Were Around in 1958! -- I Received Three As A Baby! -- My Personal Testimony! — Ted Hunter