Programming assignment #1 Kun-Mao Chao Feb. 24, 2003.

Due: March 10, 2003.

The C + G-richest region of a DNA sequence

The main aim of the Human Genome Project is to determine the roughly three billion nucleotides of the human genome, which encodes all the information of a human, by the year 2003. This DNA sequence data, essentially a gigantic string of the four letters A, C, G, and T, can help us understand and eventually treat many of the more than 4000 genetic diseases that afflict mankind. However, the data will be useless unless proper methods are developed to interpret the information that it encodes.

Computer analysis of newly determined DNA sequences usually includes a task to identify C + G-rich regions. One reasonable scheme to measure the C + G richness of a region is the scoring expression x – p × l, where x is the C + G count of the region, l is the length of the region, and p is a positive ratio constant. Given a ratio 0 < p <1, a length cutoff L, and a DNA sequence, i.e. a string of the four letters A, C, G, and T, write a program to report the score of the region of length at least L, which maximizes the expression x – p × l. (We will explain a linear-time approach in class.)

For example, if p = 0.6, L = 5, and the given DNA sequence is GACGTCCCAGCAACAAA, the C + G-richest region of length at least L under this measurement is the region from position 3 to position 11, i.e., CGTCCCAGC. This region contains 5 C’s and 2 G’s, thus the C + G count is 7. Its score is 7 – 0.6 × 9 = 1.6. This is the highest score of any region of length at least 5.

Input Format

p (a real number between 0 and 1)

L (an integer between 1 and 10000)

The DNA sequence is of length at most 50000.

The input file is terminated by a line with the number 0.

Output Format

Report the score of the region of length at least L, which maximizes the expression x – p × l.

Sample Input

0.6

GACGTC

CCAGCAACAAA

Sample Output

The highest score is 1.6.