Experimental Results on RS126 Data Set

Protein secondary structure prediction is a famous problem in bioinformatics field. Since the technique of predicting an unknown 3-D protein structure from the primary amino acid sequences is immature, scientists try to predict the elements of protein secondary structure from amino acid sequences first. However, the secondary structure prediction is also a difficult problem. Before 1993, the prediction accuracy was just slightly better than random guess, and in 1993, Rost & Sander proposed the PHD system [Rost and Sander, 1993] and made a significant improvement from 64.3% to 70.8% by using evolutionary information contained in multiple sequences alignments.

Protein secondary structure prediction has been tackled by numerous learning algorithms including neural networks, SVM and other famous classifiers [Riis and Krogh, 1996,Cuff and Barton, 1999,Hua and Sun, 2001,Ward et al., 2003] and therefore presents as a classic problem for testing the effectiveness of new techniques.

We conducted the experiments on the most famous data set used in protein secondary structure prediction, RS126. The RS126 data has been well studied in many publications [Riis and Krogh, 1996,Cuff and Barton, 1999,Hua and Sun, 2001,Ward et al., 2003], and can be downloaded at this website. Also, we adopted the same 7-fold partition used by Riis and Krogh [Riis and Krogh, 1996].

Regarding to the parameter settings of classifiers, we adopted the grid.py utility in the libsvm package to perform the model selection process of SVM , and the utility select the best parameter set from 90 parameter combinations.

All experiments have been done in the same environment and the same data sets, so the comparison should be objective. The detailed accuracy results can be seen in Table 1 and Table 2. As these two Table shows, the proposed method basically delivers the same level of accuracy with LIBSVM.

Table 1: Comparison of classification accuracy of the RS126 data set with PSI-BLAST PSSM profiles

RS126 LIBSVM QuickRBF QuickRBF QuickRBF QuickRBF
All 12000 5000 1000
Set A 74.06 74.14 74.01 73.73 72.71
Set B 77.44 77.01 76.32 75.54 74.76
Set C 74.99 75.01 75.07 74.93 73.85
Set D 73.11 73.69 73.72 72.44 71.44
Set E 74.08 74.19 74.26 73.97 73.14
Set F 76.93 77.23 77.39 77.28 76.12
Set G 73.82 74.27 74.30 74.07 74.36
Average 74.92 75.08 75.01 74.57 73.77

Table 2: Comparison of classification accuracy of the RS126 data set without PSI-BLAST PSSM profiles

RS126 LIBSVM QuickRBF QuickRBF QuickRBF QuickRBF
sv 15000 12000 8000
Set A 68.14 68.45 68.73 68.61 67.73
Set B 73.33 73.86 73.9 73.29 72.65
Set C 71.72 71.4 71.57 71.09 69.98
Set D 70.45 70.89 70.33 70.65 70.01
Set E 70.26 70.19 70.26 70.76 70.04
Set F 72.39 72.94 72.34 72.03 71.78
Set G 71.67 71.95 71.39 70.94 70.88
Average 71.14 71.38 71.22 71.05 70.44