Course: Algorithms for Biological Sequence Analysis
Fall semester, 2022
9:10 – 12:10 Tuesday, R105@CSIE Building
3 credits

Website: http://www.csie.ntu.edu.tw/~kmchao/seq22fall

Instructor: Kun-Mao Chao (趙坤茂)

Teaching assistant: 邱詠智 r10922064 followed by @ntu.edu.tw
[TA's office hours: by appointment; Venue: R432]

 

Prerequisites: Some basic knowledge on algorithms is required. Background in bioinformatics and computational biology is welcome but not required for taking this course.

 

Classmates: I  II  III  IV  V  VI

 

Coursework:

    Homework assignments and Class participation (10%)

        #1: Handout: September 6, 2022; Due: September 23, 2022 (See below)

    Two midterm exams (70%; 35% each):

        Midterm #1: Oct. 11, 2022 (Some tips for preparation were announced via NTU COOL on Oct. 6, 2022.)

            Midterm #1's Score Histogram [2022/10/25]
        Midterm #2: Nov. 22, 2022 (Some tips for preparation were announced via NTU COOL on Nov. 14, 2022.)

            Midterm #2's Score Histogram [2022/12/6]

    Oral presentation of selected papers/projects (20%)

        Nov. 29, Dec. 6, and Dec. 13 (Schedule given on 2022/10/25)

 

Supporting materials:

    About This Class     Survey [2022/9/6]

    Interesting Sequences [2022/9/6]

    A Whirlwind Tour of Bioinformatics [2022/9/6]

    Basic Algorithmic Strategies [2022/9/13]
 

              -- Basic Algorithmic Strategies [2022/9/13]

              -- Vertex_Cover_Greedy_Fail [2022/9/13]

              -- LIS example [2022/9/13]

              -- LCS_example [2022/9/13]

 

    Sequence Alignment [2022/9/13, 2022/9/20]

-- Global Alignment [2022/9/13, 2022/9/20]

-- A Note on Computing Heaviest Segments [2022/9/20]

-- Local Alignment [2022/9/20]

-- Various Scoring Schemes [2022/9/20]

-- An affine-gap-penalty example [2022/9/20]

-- Example 2; Solution [2022/9/20]

-- Scoring scheme examples [2022/9/20]

-- Needleman, Saul B.; and Wunsch, Christian D. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443–53.

-- Smith, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197.

-- Gotoh, Osamu: An improved algorithm for matching biological sequences. In: Journal of Molecular Biology. 162, 1982, S. 705-708 (PDF, 206 KB).

-- Pairwise Sequence Alignment (EMBL-EBI) [2022/9/27]

-- Pairwise Align DNA (Sequence Manipulation Suite) example [2022/9/27]

    Space-Saving Strategies [2022/9/27]

    Suboptimal Alignments [2022/9/27, 2022/10/4]

-- Space-Saving Strategies  [2022/9/27, 2022/10/4]

-- A Note for Method 7@Suboptimal Alignments [2022/9/27, 2022/10/4]

-- More on Method 7@Suboptimal Alignments [2022/9/27, 2022/10/4]

-- Eugene Myers and Webb Miller, "Optimal Alignments in Linear Space," CABIOS (Bioinformatics) 9: 169-176, 1988. [pdf]

-- Chao, K. -M., Pearson, W. R. and Miller, W., 1992, Aligning Two Sequences within a Specified Diagonal Band, Computer Applications in the Biosciences (CABIOS, now Bioinformatics), 8: 481-487.

-- Chao, K.-M., Hardison, R. C. and Miller, W., 1993, Constrained Sequence Alignment, Bulletin of Mathematical Biology, 55: 503-524.

-- Chao, K.-M., Hardison, R. C. and Miller, W., 1993, Locating Well-Conserved Regions within a Pairwise Alignment, Computer Applications in the Biosciences (CABIOS, now Bioinformatics), 9: 387-396.

-- Chao, K. -M., 1994, Computing All Suboptimal Alignments in Linear Space, Combinatorial Pattern Matching '94, Lecture Notes in Computer Science 807, 31-42, California, USA.

-- Chao, K. -M., Hardison R. C. and Miller, W., 1994, Recent Developments in Linear-Space Alignment Methods: a Survey, Journal of Computational Biology, 1: 271-291.

-- Chao, K.-M. and Miller, W., 1995, Linear-Space Algorithms that Build Local Alignments from Fragments, Algorithmica, 13: 106-134.

    Multiple Sequence Alignment [2022/10/4]

    -- Multiple Sequence Alignment [2022/10/4]

Homology Search [2022/10/18; You may skip pages 15-17 & 39-43.]

-- Homology Search Tools (pdf) [2022/10/18]

   Burrows-Wheeler Transform and FM-Index [2022/10/18, 2022/10/25]

 

              -- BWT_TENETENET [2022/10/18, 2022/10/25]

 

   Genome Reconstruction by Phillip E. C. Compeau and Pavel A. Pevzner [2022/10/25, 2022/11/1]

    -- ppt (You may skip pages 125-175 & 301-340.) [2022/10/25, 2022/11/1]

    -- Lectures by Phillip E. C. Compeau

    -- B(2,3)  B(2,4)  [2022/11/1]

    -- B(3,1)&B(3,2) [2022/11/1]

    -- Eulerian Tours by Christos H. Papadimitriou & Umesh Vazirani
(Exchange 10 and 01 of the left figure on Page 4)

    -- An example for fragment assembly as an Eulerian cycle problem  [2022/11/1]

    Pattern Identification in a Haplotype Block (ppt, pdf)  [2022/11/1, 2022/11/8]
 

              -- An example  [2022/11/8]

 

    Haplotype Inference (ppt, pdf)  [2022/11/8]
    (Read the problem formulation on pp. 1262-1263 of the pdf file)

    -- Some other ways of formulation  [2022/11/8]

    The 3SUM Problem  [2022/11/8]

 

    Heaviest Segments  [2022/11/8]

   -- A Note on Computing Heaviest Segments  [2022/11/8]

* No class on November 15 (University Anniversary)

 

    RMSQ  []

   -- Chen, K.-Y. and Chao, K.-M.,  2007, “On the Range Maximum-Sum Segment Query Problem,” Discrete Applied Mathematics, 155: 2043-2052.  []

Homework assignments:

 

#1: Handout: September 6, 2022; Due: September 23, 2022

      Submit a one-page slide, via NTU COOL, including:

 

Class presentations:

1.      The expected number of team members: 1~6;

2.      Each member is required to present in turn [about 10-15 minutes each];

3.      Revised slides should be sent to me within one week after the presentation.

4.      Questions in class are always welcome.

 

Nov. 29:

Sare Amerifar, Mahammad Norouzi, Mahmoud Ghandi:
A tool for feature extraction from biological sequences. Briefings Bioinform. 23(3) (2022)
Sofia Ormazabal, Miguel Benalcazar, Victor Nanche, Simon Moulin

Samantha Petti, Sean R. Eddy:
Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Comput. Biol. 18(3) (2022)
劉文心 高溥 彭梓瑄 陳功阜

Ingoo Lee, Hojung Nam:
Sequence-based prediction of protein binding regions and drug-target interactions. J. Cheminformatics 14(1): 5 (2022)
張景曜 姚柏仰

Dec. 6:

Quim Aguado-Puig, Santiago Marco-Sola, Juan Carlos Moure, David Castells-Rufas, Lluc Alvarez, Antonio Espinosa, Miquel Moretó:
Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm. IEEE Access 10: 63782-63796 (2022)
Thomas Mannhart, Jennifer He, Tadeo Hepperle, Yi Hong Liu, Janko Strizak

 

Takuya Mieno, Kiichi Watanabe, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda:
Palindromic trees for a sliding window and its applications. Inf. Process. Lett. 173: 106174 (2022)
邱詠智 劉雅嫻 蔡岳璇 嚴鈞琦 黃品淳 余書齊

 

 Stéphane Samson, Etienne Lord, Vladimir Makarenkov:
SimPlot++: a Python application for representing sequence similarity and detecting recombination. Bioinform. 38(11): 3118-3120 (2022)
黃冠博

 

Dec. 13:

Bikash Shrestha, Badri Adhikari:
Scoring protein sequence alignments using deep learning. Bioinform. 38(11): 2988-2995 (2022)
Quinn McGarry, Sitaresmi Wahyu Handani

 

Huan Liu, Quan Zou, Yun Xu:
A novel fast multiple nucleotide sequence alignment method based on FM-index. Briefings Bioinform. 23(1) (2022)
譚雋飛

 

Rubén Langarita, Adrià Armejach, Javier Setoain, Pablo Ibáñez-Marín, Jesús Alastruey-Benedé, Miquel Moretó:
Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps. IEEE ACM Trans. Comput. Biol. Bioinform. 19(1): 355-368 (2022)
許晉捷 徐梓豪 王怡堯 江明彥 許紅媛

 Nicola De Maio, William R. Boulton, Lukas Weilguny, Conor R. Walker, Yatish Turakhia, Russell Corbett-Detig, Nick Goldman:
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput. Biol. 18(4) (2022)
吳立夫 葉奎麟 王維謙

 Daniel Gabric, Joe Sawada:
Investigating the discrepancy property of de Bruijn sequences. Discret. Math. 345(4): 112780 (2022)
周柏宇

 

Selected papers for presentation:

(You may access these selected articles using computers with NTU IP addresses.)

1.      Ewen Callaway:
From Neanderthal genome to Nobel prize: meet geneticist Svante Pääbo -- Nature asked the pioneer of ancient-DNA research about some of his greatest discoveries. NEWS Q&A, 07 October 2022, Nature.

Green RE, …, Reich D, Pääbo S. A draft sequence of the Neandertal genome. Science. 2010 May 7;328(5979):710-722.

2.      1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR.:
A global reference for human genetic variation. Nature. 2015 Oct 1;526(7571):68-74.

3.      Dominik Kempa, Tomasz Kociumaka:
Resolution of the burrows-wheeler transform conjecture. Commun. ACM 65(6): 91-98 (2022)

4.      Quim Aguado-Puig, Santiago Marco-Sola, Juan Carlos Moure, David Castells-Rufas, Lluc Alvarez, Antonio Espinosa, Miquel Moretó:
Accelerating Edit-Distance Sequence Alignment on GPU Using the Wavefront Algorithm. IEEE Access 10: 63782-63796 (2022)

5.      Sare Amerifar, Mahammad Norouzi, Mahmoud Ghandi:
A tool for feature extraction from biological sequences. Briefings Bioinform. 23(3) (2022)

6.      Huan Liu, Quan Zou, Yun Xu:
A novel fast multiple nucleotide sequence alignment method based on FM-index. Briefings Bioinform. 23(1) (2022)

7.      Yongqing Zhang, Qiang Zhang, Jiliu Zhou, Quan Zou:
A survey on the algorithm and development of multiple sequence alignment. Briefings Bioinform. 23(3) (2022)

8.      Fatemeh Almodaresi, Jamshed Khan, Sergey Madaminov, Michael Ferdman, Rob Johnson, Prashant Pandey, Rob Patro:
An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinform. 38(12): 3155-3163 (2022)

9.      Giovanni Birolo, Andrea Telatin:
BamToCov: an efficient toolkit for sequence coverage calculations. Bioinform. 38(9): 2617-2618 (2022)

10.  Qian Feng, Kathryn E. Tiedje, Shazia Ruybal-Pesántez, Gerry Tonkin-Hill, Michael F. Duffy, Karen P. Day, Heejung Shim, Yao-Ban Chan:
An accurate method for identifying recent recombinants from unaligned sequences. Bioinform. 38(7): 1823-1829 (2022)

11.  Thomas Krannich, W. Timothy J. White, Sebastian Niehus, Guillaume Holley, Bjarni V. Halldórsson, Birte Kehr:
Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinform. 38(3): 604-611 (2022)

12.  Rory Munro, Roberto Santos, Alexander Payne, Teri Forey, Solomon Osei, Nadine Holmes, Matthew Loose:
minoTour, real-time monitoring and analysis for nanopore sequencers. Bioinform. 38(4): 1133-1135 (2022)

13.  Stéphane Samson, Etienne Lord, Vladimir Makarenkov:
SimPlot++: a Python application for representing sequence similarity and detecting recombination. Bioinform. 38(11): 3118-3120 (2022)

14.  Bikash Shrestha, Badri Adhikari:
Scoring protein sequence alignments using deep learning. Bioinform. 38(11): 2988-2995 (2022)

15.  Sofia Persson, Christina Larsson, Magnus Simonsson, Patrik Ellström:
rprimer: an R/bioconductor package for design of degenerate oligos for sequence variable viruses. BMC Bioinform. 23(1): 239 (2022)

16.  Daniel Gabric, Joe Sawada:
Investigating the discrepancy property of de Bruijn sequences. Discret. Math. 345(4): 112780 (2022)

17.  Shubham, Surya Prakash, Pramod Ganapathi:
An Algorithm for the Sequence Alignment with Gap Penalty Problem using Multiway Divide-and-Conquer and Matrix Transposition. Inf. Process. Lett. 173: 106166 (2022)

18.  Chengze Shen, Minhyuk Park, Tandy J. Warnow:
WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment. J. Comput. Biol. 29(8): 782-801 (2022)

19.  Ingoo Lee, Hojung Nam:
Sequence-based prediction of protein binding regions and drug-target interactions. J. Cheminformatics 14(1): 5 (2022)

20.  Younes Bouhadjar, Dirk J. Wouters, Markus Diesmann, Tom Tetzlaff:
Sequence learning, prediction, and replay in networks of spiking neurons. PLoS Comput. Biol. 18(6) (2022)

21.  Nicola De Maio, William R. Boulton, Lukas Weilguny, Conor R. Walker, Yatish Turakhia, Russell Corbett-Detig, Nick Goldman:
phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets. PLoS Comput. Biol. 18(4) (2022)

22.  Samantha Petti, Sean R. Eddy:
Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Comput. Biol. 18(3) (2022)

23.  Rubén Langarita, Adrià Armejach, Javier Setoain, Pablo Ibáñez-Marín, Jesús Alastruey-Benedé, Miquel Moretó:
Compressed Sparse FM-Index: Fast Sequence Alignment Using Large K-Steps. IEEE ACM Trans. Comput. Biol. Bioinform. 19(1): 355-368 (2022)

24.  Lenore Pipes, Rasmus Nielsen:
AncestralClust: clustering of divergent nucleotide sequences by ancestral sequence reconstruction using phylogenetic trees. Bioinform. 38(3): 663-670 (2022)

25.  Takuya Mieno, Kiichi Watanabe, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda:
Palindromic trees for a sliding window and its applications. Inf. Process. Lett. 173: 106174 (2022)

References (Recommended, but not required):
1. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield (1997)
2. Biological Sequence Analysis, by Richard Durbin et al. (1998)
3. Computational Molecular Biology: An Algorithmic Approach, by Pavel Pevzner (2000)
4. An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel Pevzner (2004)
5. Sequence Comparison: Theory and Methods,” by Chao, K.-M. and Zhang, L. (2009)
6. Bioinformatics for Biologists, edited by Pavel Pevzner and Ron Shamir (2011)
7. Bioinformatics Algorithms: an Active Learning Approach, Vol. I, II by Phillip E. C. Compeau and Pavel A. Pevzner (August 2015)

[http://rosalind.info/problems/list-view/?location=bioinformatics-textbook-track]

8. Related journal and conference papers
 
Bibliography links: