Exact and approximation algorithms for DNA sequence reconstruction.

Persistent Link:
http://hdl.handle.net/10150/185673
Title:
Exact and approximation algorithms for DNA sequence reconstruction.
Author:
Kececioglu, John Dimitri.
Issue Date:
1991
Publisher:
The University of Arizona.
Rights:
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
Abstract:
The DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.
Type:
text; Dissertation-Reproduction (electronic)
Keywords:
DNA -- Computer programs.
Degree Name:
Ph.D.
Degree Level:
doctoral
Degree Program:
Computer Science; Graduate College
Degree Grantor:
University of Arizona
Advisor:
Myers, Eugene W.

Full metadata record

DC FieldValue Language
dc.language.isoenen_US
dc.titleExact and approximation algorithms for DNA sequence reconstruction.en_US
dc.creatorKececioglu, John Dimitri.en_US
dc.contributor.authorKececioglu, John Dimitri.en_US
dc.date.issued1991en_US
dc.publisherThe University of Arizona.en_US
dc.rightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.en_US
dc.description.abstractThe DNA sequence in every human being is a text of three billion characters from a four letter alphabet; determining this sequence is a major project in molecular biology. The fundamental task biologists face is to reconstruct a long sequence given short fragments from unknown locations. These fragments contain errors, and may represent the sequence on one strand of the double-helix, or the reverse complement sequence on the other strand. The Sequence Reconstruction Problem is, given a collection F of fragment sequences and an error rate 0 ≤ ε < 1, find a shortest sequence S such that every fragment F ∈ F, or its reverse complement, matches a substring of S with at most ε|F| errors. Sequence Reconstruction is NP-complete. We decompose the problem into (1) constructing a graph of approximate overlaps between pairs of fragments, (2) selecting a set of overlaps of maximum total weight that induce a consistent layout of the fragments, (3) merging the overlaps into a multiple sequence alignment and voting on a consensus. A solution to (1) through (3) yields a reconstructed sequence feasible at error rate 2ε/(1-ε) and at most a factor 1/1-ε longer than the shortest reconstruction, given some assumptions on fragment error. We define a measure of the overlap in a reconstruction, show that maximizing the overlap minimizes the length, and that approximating (2) within a factor of α approximates Sequence Reconstruction within a factor of (1- ε)α under the overlap measure. We construct the overlap graph for (1) in O(εN²) time given fragments of total length N at error rate ε. We develop two exact and two approximation algorithms for (2). Our best exact algorithm computes an optimal layout for a graph of E overlaps and V fragments in O(K(E + V log V)) time, where K ≤ 2ᴱ is the size of the branch-and-bound search tree. Our best approximation algorithm computes a layout with overlap at least 1/2 the maximum in O(V(E + V log V)log V) time. This is the first treatment of Sequence Reconstruction with inexact data and unknown complementarity.en_US
dc.typetexten_US
dc.typeDissertation-Reproduction (electronic)en_US
dc.subjectDNA -- Computer programs.en_US
thesis.degree.namePh.D.en_US
thesis.degree.leveldoctoralen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.disciplineGraduate Collegeen_US
thesis.degree.grantorUniversity of Arizonaen_US
dc.contributor.advisorMyers, Eugene W.en_US
dc.contributor.committeememberDowney, Peteren_US
dc.contributor.committeememberManber, Udien_US
dc.identifier.proquest9210281en_US
dc.identifier.oclc704395410en_US
All Items in UA Campus Repository are protected by copyright, with all rights reserved, unless otherwise indicated.