Kun-Mao
PS. I won't pass Yang-Ho's draft around because it's
still under the reviewing process. Yang-Ho, welcome back!
Talk #1: December 12, 2008 (CSIE seminar time)
Assembly for Double-Ended Short-Read Sequencing Technologies
Steven Skiena
Department of Computer Science
State University of New York
Stony Brook, NY 11794-4400 USA
http://www.cs.sunysb.edu/~skiena
Next-generation sequencing technologies developed by Solexa/Illumina,
Agencourt/ABI, and Helicos Biosciences yield sequencing reads which
are dramatically shorter (20-40 bases) but vastly cheaper than
than those produced by the previous generation of sequencing machines.
We study the space of read length, sequencing error rate, and coverage
that lies well outside conventional assumptions to determine the
technological/economic parameters where de novo sequencing
will be achievable with these new technologies.
We prove that genome assembly on bacterial and human sequences is
possible using astonishingly short reads, given sufficiently high coverage.
In particular, we demonstrate that we can assemble bacterial genomes using
data from ABI's recently-launched SoLID sequencer.
(Joint work with J. Chen and S. Hossain.)
Biography: Steven Skiena is Professor of Computer Science at SUNY Stony
Brook.
His research interests include the design of graph, string, and geometric
algorithms, and their applications (particularly to biology). He is the
author of four books, including "The Algorithm Design Manual" and
"Calculated Bets: Computers, Gambling, and Mathematical Modeling to Win".
He is recipient of the ONR Young Investigator Award and the
IEEE Computer Science and Engineering Undergraduate Teaching Award.
Talk #2: 1:20pm December 10, 2008; Room R107
ReSEQ: Mapping Reads with Statistical Evaluation of
Quality Scores for Genome Resequencing
Yangho Chen
Program in Computational Biology and Bioinformatics
University of Southern California
We have developed ReSEQ, a program which efficiently maps
millions of short reads from a Solexa High-throughput Sequencer onto a
reference genome or transcriptome. ReSEQ iteratively maps, weights, and
calls SNPs for reads to maximize the accuracy in estimation of the target
genome or the expression levels. The mapping algorithm in ReSEQ uses optimal
single spaced seeds and presents an integer programming method to generate
paired seeds. The spaced seeds allow efficient reporting of all alignments
within three substitutions or Insertions/deletions of length less than three
base pairs. Compared to other existing methods, the single spaced seed
increases speed and sensitivity while requiring only one-third of the memory
used by existing programs. This design makes it possible to load the hash
table for the whole human genome or transcriptome to memory on a server or
desktop respectively. For each alignment, ReSEQ calculates statistical
significance using a dynamic programming algorithm on a Markov model learned
from the background sequences. ReSEQ estimates the rates at which the
machines create sequencing errors with an EM algorithm. Reads which map
significantly to multiple locations are weighted according to the
probability that each location is responsible for the read, balancing the
goal of high coverage and necessity of statistical rigor. ReSEQ uses a
likelihood ratio test based on quality scores to distinguish sequencing
errors from SNPs, iteratively re-mapping reads to provide the best estimate
for the target genome from which reads were sequenced. Test results show
that iterative re-mapping and re-estimating the target genome sequence
significantly increases the number of mapped reads and called SNPs.
(Joint work with Tade Souaiaia
and Ting Chen.)
Biography: Yangho Chen is currently a Ph.D. candidate at USC. He
received the B.S. degree in computer science and information engineering
from National Taiwan University in 2003. His research interests include
algorithms and bioinformatics.