Persistent Link:
http://hdl.handle.net/10150/223348
Title:
Machine Learning Methods for Articulatory Data
Author:
Berry, Jeffrey James
Issue Date:
2012
Publisher:
The University of Arizona.
Rights:
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
Abstract:
Humans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data useful to speech scientists, and for speech technology applications. This dissertation presents feature extraction methods for ultrasound images of the tongue and for data collected with an Electro-Magnetic Articulograph (EMA). The usefulness of these features is tested in several phoneme classification tasks. Feature extraction methods for ultrasound tongue images presented here consist of automatically tracing the tongue surface contour using a modified Deep Belief Network (DBN) (Hinton et al. 2006), and methods inspired by research in face recognition which use the entire image. The tongue tracing method consists of training a DBN as an autoencoder on concatenated images and traces, and then retraining the first two layers to accept only the image at runtime. This 'translational' DBN (tDBN) method is shown to produce traces comparable to those made by human experts. An iterative bootstrapping procedure is presented for using the tDBN to assist a human expert in labeling a new data set. Tongue contour traces are compared with the Eigentongues method of (Hueber et al. 2007), and a Gabor Jet representation in a 6-class phoneme classification task using Support Vector Classifiers (SVC), with Gabor Jets performing the best. These SVC methods are compared to a tDBN classifier, which extracts features from raw images and classifies them with accuracy only slightly lower than the Gabor Jet SVC method.For EMA data, supervised binary SVC feature detectors are trained for each feature in three versions of Distinctive Feature Theory (DFT): Preliminaries (Jakobson et al. 1954), The Sound Pattern of English (Chomsky and Halle 1968), and Unified Feature Theory (Clements and Hume 1995). Each of these feature sets, together with a fourth unsupervised feature set learned using Independent Components Analysis (ICA), are compared on their usefulness in a 46-class phoneme recognition task. Phoneme recognition is performed using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001), which takes advantage of the temporal nature of speech, by looking at observations adjacent in time. Results of the phoneme recognition task show that Unified Feature Theory performs slightly better than the other versions of DFT. Surprisingly, ICA actually performs worse than running the CRF on raw EMA data.
Type:
text; Electronic Dissertation
Keywords:
Conditional Random Fields; Deep Belief Networks; Machine Learning; Ultrasound Imaging; Linguistics; Articulatory Speech Data; Automatic Speech Recognition
Degree Name:
Ph.D.
Degree Level:
doctoral
Degree Program:
Graduate College; Linguistics
Degree Grantor:
University of Arizona
Advisor:
Archangeli, Diana B.; Fasel, Ian R.

Full metadata record

DC FieldValue Language
dc.language.isoenen_US
dc.titleMachine Learning Methods for Articulatory Dataen_US
dc.creatorBerry, Jeffrey Jamesen_US
dc.contributor.authorBerry, Jeffrey Jamesen_US
dc.date.issued2012-
dc.publisherThe University of Arizona.en_US
dc.rightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.en_US
dc.description.abstractHumans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data useful to speech scientists, and for speech technology applications. This dissertation presents feature extraction methods for ultrasound images of the tongue and for data collected with an Electro-Magnetic Articulograph (EMA). The usefulness of these features is tested in several phoneme classification tasks. Feature extraction methods for ultrasound tongue images presented here consist of automatically tracing the tongue surface contour using a modified Deep Belief Network (DBN) (Hinton et al. 2006), and methods inspired by research in face recognition which use the entire image. The tongue tracing method consists of training a DBN as an autoencoder on concatenated images and traces, and then retraining the first two layers to accept only the image at runtime. This 'translational' DBN (tDBN) method is shown to produce traces comparable to those made by human experts. An iterative bootstrapping procedure is presented for using the tDBN to assist a human expert in labeling a new data set. Tongue contour traces are compared with the Eigentongues method of (Hueber et al. 2007), and a Gabor Jet representation in a 6-class phoneme classification task using Support Vector Classifiers (SVC), with Gabor Jets performing the best. These SVC methods are compared to a tDBN classifier, which extracts features from raw images and classifies them with accuracy only slightly lower than the Gabor Jet SVC method.For EMA data, supervised binary SVC feature detectors are trained for each feature in three versions of Distinctive Feature Theory (DFT): Preliminaries (Jakobson et al. 1954), The Sound Pattern of English (Chomsky and Halle 1968), and Unified Feature Theory (Clements and Hume 1995). Each of these feature sets, together with a fourth unsupervised feature set learned using Independent Components Analysis (ICA), are compared on their usefulness in a 46-class phoneme recognition task. Phoneme recognition is performed using a linear-chain Conditional Random Field (CRF) (Lafferty et al. 2001), which takes advantage of the temporal nature of speech, by looking at observations adjacent in time. Results of the phoneme recognition task show that Unified Feature Theory performs slightly better than the other versions of DFT. Surprisingly, ICA actually performs worse than running the CRF on raw EMA data.en_US
dc.typetexten_US
dc.typeElectronic Dissertationen_US
dc.subjectConditional Random Fieldsen_US
dc.subjectDeep Belief Networksen_US
dc.subjectMachine Learningen_US
dc.subjectUltrasound Imagingen_US
dc.subjectLinguisticsen_US
dc.subjectArticulatory Speech Dataen_US
dc.subjectAutomatic Speech Recognitionen_US
thesis.degree.namePh.D.en_US
thesis.degree.leveldoctoralen_US
thesis.degree.disciplineGraduate Collegeen_US
thesis.degree.disciplineLinguisticsen_US
thesis.degree.grantorUniversity of Arizonaen_US
dc.contributor.advisorArchangeli, Diana B.en_US
dc.contributor.advisorFasel, Ian R.en_US
dc.contributor.committeememberBever, Thomas G.en_US
dc.contributor.committeememberMorrison, Clayton T.en_US
dc.contributor.committeememberChan, Erwinen_US
dc.contributor.committeememberArchangeli, Diana B.en_US
dc.contributor.committeememberFasel, Ian R.en_US
All Items in UA Campus Repository are protected by copyright, with all rights reserved, unless otherwise indicated.