“Mel-frequency cepstrum”的意思、由来-开放百科全书

In sound processing, the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency.

Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC^[1]. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression.

There can be variations on this process, for example: differences in the shape or spacing of the windows used to map the scale,^[3] or addition of dynamics features such as "delta" and "delta-delta" (first- and second-order frame-to-frame difference) coefficients.^[4]

The European Telecommunications Standards Institute in the early 2000s defined a standardised MFCC algorithm to be used in mobile phones.^[5]

Applications

MFCCs are commonly used as features in speech recognition^[6] systems, such as the systems which can automatically recognize numbers spoken into a telephone.

MFCCs are also increasingly finding uses in music information retrieval applications such as genre classification, audio similarity measures, etc.^[7]

Noise sensitivity

MFCC values are not very robust in the presence of additive noise, and so it is common to normalise their values in speech recognition systems to lessen the influence of noise. Some researchers propose modifications to the basic MFCC algorithm to improve robustness, such as by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT (Discrete Cosine Transform), which reduces the influence of low-energy components.^[8]

History

Paul Mermelstein^[9]^[10] is typically credited with the development of the MFC. Mermelstein credits Bridle and Brown^[11] for the idea:

Many authors, including Davis and Mermelstein,^[10] have commented that the spectral basis functions of the cosine transform in the MFC are very similar to the principal components of the log spectra, which were applied to speech representation and recognition much earlier by Pols and his colleagues.^[13]^[14]

See also

References

1. ^{{cite book | chapter = HMM-based audio keyword generation | author = Min Xu | title = Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia | editor1 = Kiyoharu Aizawa | editor2 = Yuichi Nakamura | editor3 = Shin'ichi Satoh | publisher = Springer | year = 2004 | isbn = 978-3-540-23985-7 | chapter-url = http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | archive-url = https://web.archive.org/web/20070510193153/http://cemnet.ntu.edu.sg/home/asltchia/publication/AudioAnalysisUnderstanding/Conference/HMM-Based%20Audio%20Keyword%20Generation.pdf | dead-url = yes | archive-date = 2007-05-10 | display-authors = etal }}
2. ^{{cite journal|last=Sahidullah|first=Md.|author2=Saha, Goutam|title=Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition|journal=Speech Communication|date=May 2012|volume=54|issue=4|pages=543–565|doi=10.1016/j.specom.2011.11.004}}
3. ^Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "[https://link.springer.com/article/10.1007%2FBF02943243?LI=true#page-1 Comparison of Different Implementations of MFCC]," J. Computer Science & Technology, 16(6): 582–589.
4. ^S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"
5. ^European Telecommunications Standards Institute (2003), Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. Technical standard ES 201 108, v1.1.3.
6. ^T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task {{webarchive|url=https://web.archive.org/web/20110717210107/http://www.wcl.ece.upatras.gr/ganchev/Papers/ganchev17.pdf |date=2011-07-17 }}," in 10th International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194.
7. ^{{cite book | title = Information Retrieval for Music and Motion | author = Meinard Müller | publisher = Springer | year = 2007 | isbn = 978-3-540-74047-6 | page = 65 | url = https://books.google.com/books?id=kSzeZWR2yDsC&pg=PA65&dq=mfcc+music+applications#PPA65,M1 }}
8. ^V. Tyagi and C. Wellekens (2005), {{doi-inline|10.1109/ICASSP.2005.1415167|On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition}}, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, pp. 529–532.
9. ^¹P. Mermelstein (1976), "[https://books.google.com/books?id=wW9QAAAAMAAJ&q=%22Distance+measures+for+speech+recognition,+psychological+and+instrumental%22&dq=%22Distance+measures+for+speech+recognition,+psychological+and+instrumental%22&lr=&as_brr=0&as_pt=ALLTYPES&ei=zdRmSZjKLoH4lQTfqaXhBg&pgis=1 Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence], C. H. Chen, Ed., pp. 374–388. Academic, New York.
10. ^¹S.B. Davis, and P. Mermelstein (1980), "[https://books.google.com/books?id=yjzCra5eW3AC&pg=PA65&dq=cosine+mel+pols&lr=&as_brr=3&ei=ytJmSZGLNI6ukAThwuGxCA#PPA65,M1 Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences]," in IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366.
11. ^J. S. Bridle and M. D. Brown (1974), "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip, England.
12. ^{{cite book | chapter = Automatic Speech Recognition: An Auditory Perspective |author1=Nelson Morgan |author2=Hervé Bourlard |author3=Hynek Hermansky |last-author-amp=yes | title = Speech Processing in the Auditory System |editor1=Steven Greenberg |editor2=William A. Ainsworth | publisher = Springer | year = 2004 | isbn = 978-0-387-00590-4 | page = 315 | chapter-url = https://books.google.com/books?id=xWU2o08AxwwC&pg=PA315&dq=mel-frequency+Mermelstein+Bridle}}
13. ^L. C. W. Pols (1966), "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertion, Free University, Amsterdam, The Netherlands
14. ^R. Plomp, L. C. W. Pols, and J. P. van de Geer (1967). "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712.

Applications

Noise sensitivity

History

See also

References

External links