Persistent Link:
http://hdl.handle.net/10150/612932
Title:
Parameter Advising for Multiple Sequence Alignment
Author:
DeBlasio, Daniel Frank
Issue Date:
2016
Publisher:
The University of Arizona.
Rights:
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
Abstract:
The problem of aligning multiple protein sequences is essential to many biological analyses, but most standard formulations of the problem are NP-complete. Due to both the difficulty of the problem and its practical importance, there are many heuristic multiple sequence aligners that a researcher has at their disposal. A basic issue that frequently arises is that each of these alignment tools has a multitude of parameters that must be set, and which greatly affect the quality of the alignment produced. Most users rely on the default parameter setting that comes with the aligner, which is optimal on average, but can produce a low-quality alignment for the given inputs. This dissertation develops an approach called parameter advising to find a parameter setting that produces a high-quality alignment for each given input. A parameter advisor aligns the input sequences for each choice in a collection of parameter settings, and then selects the best alignment from the resulting alignments produced. A parameter advisor has two major components: (i) an advisor set of parameter choices that are given to the aligner, and (ii) an accuracy estimator that is used to rank alignments produced by the aligner. Alignment accuracy is measured with respect to a known reference alignment, in practice a reference alignment is not available, and we can only estimate accuracy. We develop a new accuracy estimator that we call called Facet (short for "feature-based accuracy estimator") that computes an accuracy estimate as a linear combination of efficiently-computable feature functions, whose coefficients are learned by solving a large scale linear programming problem. We also develop an efficient approximation algorithm for finding an advisor set of a given cardinality for a fixed estimator, whose cardinality should ideally small, as the aligner is invoked for each parameter choice in the set. Using Facet for parameter advising boosts advising accuracy by almost 20% beyond using a single default parameter choice for the hardest-to-align benchmarks. This dissertation further applies parameter advising in two ways: (i) to ensemble alignment, which uses the advising process on a collection of aligners to choose both the aligner and its parameter settings, and (ii) to adaptive local realignment, which can align different regions of the input sequences with distinct parameter choices to conform to mutation rates as they vary across the lengths of the sequences.
Type:
text; Electronic Dissertation
Keywords:
Computer Science
Degree Name:
Ph.D.
Degree Level:
doctoral
Degree Program:
Graduate College; Computer Science
Degree Grantor:
University of Arizona
Advisor:
Kececioglu, John

Full metadata record

DC FieldValue Language
dc.language.isoen_USen
dc.titleParameter Advising for Multiple Sequence Alignmenten_US
dc.creatorDeBlasio, Daniel Franken
dc.contributor.authorDeBlasio, Daniel Franken
dc.date.issued2016-
dc.publisherThe University of Arizona.en
dc.rightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.en
dc.description.abstractThe problem of aligning multiple protein sequences is essential to many biological analyses, but most standard formulations of the problem are NP-complete. Due to both the difficulty of the problem and its practical importance, there are many heuristic multiple sequence aligners that a researcher has at their disposal. A basic issue that frequently arises is that each of these alignment tools has a multitude of parameters that must be set, and which greatly affect the quality of the alignment produced. Most users rely on the default parameter setting that comes with the aligner, which is optimal on average, but can produce a low-quality alignment for the given inputs. This dissertation develops an approach called parameter advising to find a parameter setting that produces a high-quality alignment for each given input. A parameter advisor aligns the input sequences for each choice in a collection of parameter settings, and then selects the best alignment from the resulting alignments produced. A parameter advisor has two major components: (i) an advisor set of parameter choices that are given to the aligner, and (ii) an accuracy estimator that is used to rank alignments produced by the aligner. Alignment accuracy is measured with respect to a known reference alignment, in practice a reference alignment is not available, and we can only estimate accuracy. We develop a new accuracy estimator that we call called Facet (short for "feature-based accuracy estimator") that computes an accuracy estimate as a linear combination of efficiently-computable feature functions, whose coefficients are learned by solving a large scale linear programming problem. We also develop an efficient approximation algorithm for finding an advisor set of a given cardinality for a fixed estimator, whose cardinality should ideally small, as the aligner is invoked for each parameter choice in the set. Using Facet for parameter advising boosts advising accuracy by almost 20% beyond using a single default parameter choice for the hardest-to-align benchmarks. This dissertation further applies parameter advising in two ways: (i) to ensemble alignment, which uses the advising process on a collection of aligners to choose both the aligner and its parameter settings, and (ii) to adaptive local realignment, which can align different regions of the input sequences with distinct parameter choices to conform to mutation rates as they vary across the lengths of the sequences.en
dc.typetexten
dc.typeElectronic Dissertationen
dc.subjectComputer Scienceen
thesis.degree.namePh.D.en
thesis.degree.leveldoctoralen
thesis.degree.disciplineGraduate Collegeen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorUniversity of Arizonaen
dc.contributor.advisorKececioglu, Johnen
dc.contributor.committeememberSanderson, Michaelen
dc.contributor.committeememberKobourov, Stephenen
dc.contributor.committeememberEfrat, Alonen
dc.contributor.committeememberKececioglu, Johnen
All Items in UA Campus Repository are protected by copyright, with all rights reserved, unless otherwise indicated.