Course: Algorithms for Biological Sequence Analysis
Fall semester, 2019
9:10 – 12:10 Tuesday, R107@CSIE Building
3 credits

Website: http://www.csie.ntu.edu.tw/~kmchao/seq19fall

Instructor: Kun-Mao Chao (趙坤茂)

Teaching assistant:  百榆 r07922177 followed by @ntu.edu.tw
[TA's office hours: by appointment; Venue: R432]

 

Prerequisites: Some basic knowledge on algorithms is required. Background in bioinformatics and computational biology is welcome but not required for taking this course.

 

Classmates: I  II  III  IV  V  VI

 

Coursework:

    Homework assignments and Class participation (10%)

        #1: Handout: September 10, 2019; Due: September 27, 2019 (See below)

    Two midterm exams (70%; 35% each):

        Midterm #1: Oct. 22, 2019
       
Midterm #2: Dec. 3, 2019

    Oral presentation of selected papers/projects (20%)

        [Dec. 10 for preparation],
        Dec. 17, Dec. 24, Dec. 31.

 

Supporting materials:

    About This Class     Survey [2019/9/10]

    Interesting Sequences [2019/9/10]

    A Whirlwind Tour of Bioinformatics [2019/9/10]

    Basic Algorithmic Strategies [2019/9/17]
 

              -- Basic Algorithmic Strategies [2019/9/17]

              -- LIS example [2019/9/17]

              -- LCS_example  [2019/9/17]

 

    Sequence Alignment [2019/9/17, 2019/9/24, 2019/10/1]

-- Global Alignment [2019/9/17, 2019/9/24]

-- A Note on Computing Heaviest Segments [Pages 7-8] [2019/9/24]

-- Local Alignment [2019/9/24]

-- Various Scoring Schemes [2019/9/24]

-- An affine-gap-penalty example [2019/9/24]

-- Example 2; Solution [2019/9/24]

-- Scoring scheme examples [2019/9/24]

-- Needleman, Saul B.; and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443–53.

-- Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197.

-- Gotoh, Osamu: An improved algorithm for matching biological sequences. In: Journal of Molecular Biology. 162, 1982, S. 705-708 (PDF, 206 KB).

    Space-Saving Strategies [2019/10/1]

    Suboptimal Alignments [2019/10/8]

-- Space-Saving Strategies (pdf) [2019/10/1, 2019/10/8]

-- A Note for Method 7@Suboptimal Alignments [2019/10/8]

-- Eugene Myers and Webb Miller, "Optimal Alignments in Linear Space," CABIOS (Bioinformatics) 9: 169-176, 1988. [pdf]

-- Chao, K. -M., Pearson, W. R. and Miller, W., 1992, Aligning Two Sequences within a Specified Diagonal Band, Computer Applications in the Biosciences (CABIOS, now Bioinformatics), 8: 481-487.

-- Chao, K.-M., Hardison, R. C. and Miller, W., 1993, Constrained Sequence Alignment, Bulletin of Mathematical Biology, 55: 503-524.

-- Chao, K.-M., Hardison, R. C. and Miller, W., 1993, Locating Well-Conserved Regions within a Pairwise Alignment, Computer Applications in the Biosciences (CABIOS, now Bioinformatics), 9: 387-396.

-- Chao, K. -M., 1994, Computing All Suboptimal Alignments in Linear Space, Combinatorial Pattern Matching '94, Lecture Notes in Computer Science 807, 31-42, California, USA.

-- Chao, K. -M., Hardison R. C. and Miller, W., 1994, Recent Developments in Linear-Space Alignment Methods: a Survey, Journal of Computational Biology, 1: 271-291.

-- Chao, K.-M. and Miller, W., 1995, Linear-Space Algorithms that Build Local Alignments from Fragments, Algorithmica, 13: 106-134.

    Multiple Sequence Alignment [2019/10/15]

    -- Multiple Sequence Alignment [2019/10/15]

Homology Search [2019/10/29, 2019/11/5]

-- Homology Search Tools (pdf) [2019/10/29, 2019/11/5]

   Burrows-Wheeler Transform and FM-Index [2019/10/29]

 

   Genome Reconstruction by Phillip E. C. Compeau and Pavel A. Pevzner [2019/11/12]

    -- ppt (You may skip pages 125-175 & 301-340.) [2019/11/12]

    -- Lectures by Phillip E. C. Compeau

    -- B(2,3) [2019/11/12]

    -- Eulerian Tours by Christos H. Papadimitriou & Umesh Vazirani [2019/11/12]
(Exchange 10 and 01 of the left figure on Page 4)

    -- An example for fragment assembly as an Eulerian cycle problem [2019/11/12]

    Pattern Identification in a Haplotype Block (ppt, pdf) [2019/11/19]
 

              -- An example [2019/11/19]

 

    Haplotype Inference (ppt, pdf) [2019/11/19, 2019/11/26]
    (Read the problem formulation on pp. 1262-1263 of the pdf file)

    -- Some other ways of formulation [2019/11/26]

    Heaviest Segments [updated on 2019/11/26]

   -- A Note on Computing Heaviest Segments [2019/11/26]

    RMSQ [updated on 2019/11/26]

   -- Chen, K.-Y. and Chao, K.-M.,  2007, “On the Range Maximum-Sum Segment Query Problem,” Discrete Applied Mathematics, 155: 2043-2052. [2019/11/26]

Homework assignments:

 

#1: Handout: September 10, 2019; Due: September 27, 2019

    Send TA a PowerPoint page including: (Subject: [HW1] Your student ID number)

   A sample solution

   More sample solutions: 2007  2008  2009  2011  2015  2016  2017  2018

 

Class presentations:

1.      The expected number of team members: 1~6;

2.      Each member is required to present in turn [about 10 minutes each];

3.      Revised slides should be sent to me within one week after the presentation.

4.      Questions in class are always welcome.

 

[Dec. 10 for preparation, no class]

 

Dec. 17

Deep generative models of genetic variation capture the effects of mutations
Adam J. Riesselman, John B. Ingraham & Debora S. Marks
Nature Methods, 15, 816–822 (2018).
黃筑葭  Andreas Pirchner  Eloi Dussy Lachaud  林溥博

 

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Langmead B, Trapnell C, Pop M, Salzberg SL.
Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25.
陳百榆  王俊中

 

Linear work suffix array construction

JUHA KARKKAINEN,  PETER SANDERS,  STEFAN BURKHARDT

JACM, Volume 53 Issue 6, November 2006.

     徐展鴻

 

The streaming k-mismatch problem.
Raphaël Clifford, Tomasz Kociumaka, Ely Porat
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1106-1125.
張顥霆  林映廷  廖名淳

 

Dec. 24

Approximating LCS in Linear Time: Beating the √n Barrier.
MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, Xiaorui Sun
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1181-1200.
曾奕青  蔡昀達  張凱捷  謝宏祺 (11/20+)

 

Massively Parallel Approximation Algorithms for Edit Distance and Longest Common Subsequence.
MohammadTaghi Hajiaghayi, Saeed Seddighin, Xiaorui Sun
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1654-1672.
紀伯翰  林義聖  黃子賢  楊書文

 

Efficiently Approximating Edit Distance Between Pseudorandom Strings.
William Kuszmaul
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1165-1180.
洪靖秦  陳柔安  廖克允  蕭大哲  魏孜昀  傅家靖

 

Dec. 31

MSC: a metagenomic sequence classification algorithm
Subrata Saha, Jethro Johnson, Soumitra Pal, George M Weinstock, Sanguthevar Rajasekaran
Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2932–2940.
楊宗儒  黃志揚

 

CoMSA: Compression of protein multiple sequence alignment files.
Deorowicz S, Walczyszyn J, Debudaj-Grabysz A.
Bioinformatics. 2018 Jul 13. doi: 10.1093/bioinformatics/bty619. [Epub ahead of print]
林政豪  張立暐

 

Minimap2: pairwise alignment for nucleotide sequences
Heng Li
Bioinformatics, Volume 34, Issue 18, 15 September 2018, Pages 3094–3100.
王佑安  賴達  陳昇  洪浩翔

 

High-quality genome sequences of uncultured microbes by assembly of read clouds (11/13+)
Alex Bishara, Eli L Moss, Mikhail Kolmogorov, Alma E Parada, Ziming Weng, Arend Sidow, Anne E Dekas, Serafim Batzoglou & Ami S Bhatt
Nature Biotechnology, Published: 15 October, 2018.
譚雋飛

 

In silico read normalization using set multi-cover optimization (12/22+)
Dilip A Durai; Marcel H Schulz
Bioinformatics, Volume 34, Issue 19, 1 October 2018, Pages 3273–3280.
徐衍新

 

Selected papers for presentation:

(You may access these selected articles using computers with NTU IP addresses.)

  1. Approximating LCS in Linear Time: Beating the √n Barrier.
    MohammadTaghi Hajiaghayi, Masoud Seddighin, Saeed Seddighin, Xiaorui Sun
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1181-1200.

  2. Massively Parallel Approximation Algorithms for Edit Distance and Longest Common Subsequence.
    MohammadTaghi Hajiaghayi, Saeed Seddighin, Xiaorui Sun
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1654-1672.

  3. Efficiently Approximating Edit Distance Between Pseudorandom Strings.
    William Kuszmaul
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1165-1180.

  4. The streaming k-mismatch problem.
    Raphaël Clifford, Tomasz Kociumaka, Ely Porat
    Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, 1106-1125.

  5. CoMSA: Compression of protein multiple sequence alignment files.
    Deorowicz S, Walczyszyn J, Debudaj-Grabysz A.
    Bioinformatics. 2018 Jul 13. doi: 10.1093/bioinformatics/bty619. [Epub ahead of print]

  6. Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli
    Guillaume Cambray, Joao C Guimaraes & Adam Paul Arkin
    Nature Biotechnology, volume 36, pages 1005–1015 (2018).

  7. High-quality genome sequences of uncultured microbes by assembly of read clouds
    Alex Bishara, Eli L Moss, Mikhail Kolmogorov, Alma E Parada, Ziming Weng, Arend Sidow, Anne E Dekas, Serafim Batzoglou & Ami S Bhatt
    Nature Biotechnology, Published: 15 October, 2018.

  8. CRISPR knockout screening identifies combinatorial drug targets in pancreatic cancer and models cellular drug response
    Karol Szlachta, Cem Kuscu, Turan Tufan, Sara J. Adair, Stephen Shang, Alex D. Michaels, Matthew G. Mullen, Natasha Lopes Fischer, Jiekun Yang, Limin Liu, Prasad Trivedi, Edward B. Stelow, P. Todd Stukenberg, J. Thomas Parsons, Todd W. Bauer & Mazhar Adli
    Nature Communications, 9, Article number: 4275 (2018).

  9. Deep generative models of genetic variation capture the effects of mutations
    Adam J. Riesselman, John B. Ingraham & Debora S. Marks
    Nature Methods, 15, 816–822 (2018).

  10. In silico read normalization using set multi-cover optimization
    Dilip A Durai; Marcel H Schulz
    Bioinformatics, Volume 34, Issue 19, 1 October 2018, Pages 3273–3280.

  11. Minimap2: pairwise alignment for nucleotide sequences
    Heng Li
    Bioinformatics, Volume 34, Issue 18, 15 September 2018, Pages 3094–3100.

  12. MSC: a metagenomic sequence classification algorithm
    Subrata Saha, Jethro Johnson, Soumitra Pal, George M Weinstock, Sanguthevar Rajasekaran
    Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2932–2940.

  13. Improved indel detection in DNA and RNA via realignment with ABRA2 
    Lisle E Mose, Charles M Perou, Joel S Parker
    Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 2966–2973.

  14. PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects 
    Anastasia Gurinovich, Harold Bae, John J Farrell, Stacy L Andersen, Stefano Monti ...
    Bioinformatics, Volume 35, Issue 17, 1 September 2019, Pages 3046–3054.

  15. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
    Langmead B, Trapnell C, Pop M, Salzberg SL.
    Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25.

  16. Fast gapped-read alignment with Bowtie 2
    Ben Langmead and Steven L Salzberg
    Nat Methods. 2012 Mar 4; 9(4): 357–359.

 

References (Recommended, but not required):
1. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997)
2. Biological Sequence Analysis, by Richard Durbin et al. (1998)
3. Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000)
4. An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel Pevzner (2004)
5. Sequence Comparison: Theory and Methods,” by Chao, K.-M. and Zhang, L. (2009)
6. Bioinformatics for Biologists, edited by Pavel Pevzner and Ron Shamir (2011)
7. Bioinformatics Algorithms: an Active Learning Approach, Vol. I, II by Phillip E. C. Compeau and Pavel A. Pevzner (August 2015)

[http://rosalind.info/problems/list-view/?location=bioinformatics-textbook-track]

8. Related journal and conference papers
 
Bibliography links: