Inverse Parametric Alignment for Accurate Biological Sequence Comparison

Persistent Link:
http://hdl.handle.net/10150/193664
Title:
Inverse Parametric Alignment for Accurate Biological Sequence Comparison
Author:
Kim, Eagu
Issue Date:
2008
Publisher:
The University of Arizona.
Rights:
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
Abstract:
For as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. In practice, substitution scores are usually chosen by convention, and gap penalties are often found by trial and error. In contrast, a rigorous way to determine parameter values that are appropriate for aligning biological sequences is by solving the problem of Inverse Parametric Sequence Alignment. Given examples of biologically correct reference alignments, this is the problem of finding parameter values that make the examples score as close as possible to optimal alignments of their sequences. The reference alignments that are currently available contain regions where the alignment is not specified, which leads to a version of the problem with partial examples.In this dissertation, we develop a new polynomial-time algorithm for Inverse Parametric Sequence Alignment that is simple to implement, fast in practice, and can learn hundreds of parameters simultaneously from hundreds of examples. Computational results with partial examples show that best possible values for all 212 parameters of the standard alignment scoring model for protein sequences can be computed from 200 examples in 4 hours of computation on a standard desktop machine. We also consider a new scoring model with a small number of additional parameters that incorporates predicted secondary structure for the protein sequences. By learning parameter values for this new secondary-structure-based model, we can improve on the alignment accuracy of the standard model by as much as 15% for sequences with less than 25% identity.
Type:
text; Electronic Dissertation
Keywords:
inverse alignment; sequence analysis; computational biology; bioinformatics; algorithm; combinatorial optimization
Degree Name:
PhD
Degree Level:
doctoral
Degree Program:
Computer Science; Graduate College
Degree Grantor:
University of Arizona
Advisor:
Kececioglu, John D.
Committee Chair:
Kececioglu, John D.

Full metadata record

DC FieldValue Language
dc.language.isoENen_US
dc.titleInverse Parametric Alignment for Accurate Biological Sequence Comparisonen_US
dc.creatorKim, Eaguen_US
dc.contributor.authorKim, Eaguen_US
dc.date.issued2008en_US
dc.publisherThe University of Arizona.en_US
dc.rightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.en_US
dc.description.abstractFor as long as biologists have been computing alignments of sequences, the question of what values to use for scoring substitutions and gaps has persisted. In practice, substitution scores are usually chosen by convention, and gap penalties are often found by trial and error. In contrast, a rigorous way to determine parameter values that are appropriate for aligning biological sequences is by solving the problem of Inverse Parametric Sequence Alignment. Given examples of biologically correct reference alignments, this is the problem of finding parameter values that make the examples score as close as possible to optimal alignments of their sequences. The reference alignments that are currently available contain regions where the alignment is not specified, which leads to a version of the problem with partial examples.In this dissertation, we develop a new polynomial-time algorithm for Inverse Parametric Sequence Alignment that is simple to implement, fast in practice, and can learn hundreds of parameters simultaneously from hundreds of examples. Computational results with partial examples show that best possible values for all 212 parameters of the standard alignment scoring model for protein sequences can be computed from 200 examples in 4 hours of computation on a standard desktop machine. We also consider a new scoring model with a small number of additional parameters that incorporates predicted secondary structure for the protein sequences. By learning parameter values for this new secondary-structure-based model, we can improve on the alignment accuracy of the standard model by as much as 15% for sequences with less than 25% identity.en_US
dc.typetexten_US
dc.typeElectronic Dissertationen_US
dc.subjectinverse alignmenten_US
dc.subjectsequence analysisen_US
dc.subjectcomputational biologyen_US
dc.subjectbioinformaticsen_US
dc.subjectalgorithmen_US
dc.subjectcombinatorial optimizationen_US
thesis.degree.namePhDen_US
thesis.degree.leveldoctoralen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.disciplineGraduate Collegeen_US
thesis.degree.grantorUniversity of Arizonaen_US
dc.contributor.advisorKececioglu, John D.en_US
dc.contributor.chairKececioglu, John D.en_US
dc.contributor.committeememberDowney, Peteren_US
dc.contributor.committeememberKobourov, Stephenen_US
dc.contributor.committeememberIndik, Roberten_US
dc.identifier.proquest2901en_US
dc.identifier.oclc659749547en_US
All Items in UA Campus Repository are protected by copyright, with all rights reserved, unless otherwise indicated.