Statistical Learning Theory

Instructor: Chih-Jen Lin, Room 413, CSIE building.
TA: Yi-Wei Chen, b88052@csie
Time: Monday 9-12; no course at 12/8 and 15 (for a conference)
HW and final score
Bonus score for helping to correct class notes.
Textbook: no
Reference:
- A practical guide to support vector classification
- Schölkopf and Smola, Learning with Kernels. MIT Press, 2002.
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.
- V. Vapnik, Statistical Learning Theory. John Wiley, 1998.
- V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
- Support vector machine homepage

About this course

The goal of this course is to describe a new approach to pattern recognition problems which originated within learning theory. We will especially concentrate on the support vector estimation, a very promising classification technique.

Note that this course is highly research oriented. You want to take this course only if you are interested in doing related research.

Course Outline

Data classification and its generalized performance
Linear support vector machines
Optimization Theory
Nonlinear support vector machines
Support vector machines and neural networks
Cross validation and other model selection techniques
Practical use of support vector machines
Decomposition methods
Generalized performance of support vector machines
Support vector regression

Homework

Once every two weeks. Please write your homework/reports in English.
For late homework, the score will be exponentially decreased.

HW1-1: Use k-nearest neighborhood to train ~cjlin/software/svm/eunite2/norm39.tr and test ~cjlin/software/svm/eunite2/norm39.test links norm39.tr and norm39.test (will be removed later)
HW1-2: number of corrections of perceptron algorithm (given in the class)
HW 2-1: the calculation of E[N]=2d (given in the class)
HW2-2: explain why the following situation happens (this question is from Yu Lan at Tsinghua University)
HW2-3: given in the class. Showing that linearly independent data are linearly separable.
HW3 (due October 27, 2003):
Using libsvm to train train60000 and then test test10000. Due to the huge number of data, please a special version of libsvm at ~cjlin/tmp/mnist/libsvm-2.4
Also due to the time limit, we consider only the RBF kernel which is the default mapping function. Hence, you need to conduct model selection for selecting C and gamma. This can be done by using the file grid.py provided in libsvm (under the directory python). It calculates the cross validation accuracy using different C and gamma (range specified by you) and draw a contour. The usage of grid.py is in the README file of the same directory. The guide A practical guide to support vector classification provides more information and practical examples. An libsvm option -m specifies how much memory to use. The default is 40 but you need more memory to save time. For this problem, -m 300 is enough.
To restrict the search space, you can use choose.py to chose a subset first. After knowing the possible region of parameters, you run the whole set.
After finding the best parameter, you train the whole training set and then predict the test set.
Note: the model selection (i.e. cross validation) can be time consuming so you want to do this homework as soon as possible.
HW4 (due November 10, 2003):
HW4-1: the RBF kernel as an inner product when n = 2 (given in the class)
HW 4-2: given in the class
HW 4-3, 4-4: problems listed in the class note, given in the class
HW 4-5: use nearest neighbor to train/test the data set used in HW 3.
HW5 (due November 24, 2003):
HW5-1: If the error term of SVM formulation is defined as follows
(0.5/\gamma) \xi^2 if \xi <= \gamma
\xi - 0.5\gamma if \xi > \gamma,
where \gamma is a constant, derive its dual.
This is a difficult problem. We will ask one of you to do a presentation at November 24.
HW 5-2: we would like to study the effect of data scaling. In ~cjlin/htdocs/libsvmtools/binary there are some two-class problems in original and scaled formats. You first randomly split each file to training and testing. Conducting grid search of cross validation on the training and then predict the testing. You should observe that for non-scaled data, it is more difficult to locate good parameters and the testing performance is not good. Write a short (2-page) report to explain what you find.
HW5-3: given in the class.
HW6 (due-- see the following)
I know you are going to check this place, so I would like to say something here. For the presentation today, some did quit well but some did not. I would encourge you to spend more time on doing the project. They are all interesting research topics. You don't need to read many papers, nor do you want to try many approaches. You should be more focused and produce some conclusive results. This is how a research work is done... Anyway, I still keep the due date to be December 15, but HW 8 will be due at December 22.
HW 6-1: Redo HW3 on the training set 22features and testing set 22test. You also would like to compare svm with nearest neighbor.
HW 6-2: Write a simple SVM classifier using the SMO-type algorithm taught earlier. You consider dense input format only. Basically you use the simple working set selection and solve two-variable optimization problems until given stopping tolerance is satisfied. You can write the predictor in the same program so we immediately know the test accuracy. Kernel elements are calculated by need.
Requirement:
1. The code should be less than 300 lines in any high-level language. Please print your code as part of your report.
2. RBF kernel is enough
3. You need to run data of HW 5-2 using some given parameters. On some (maybe most) parameters your code should be faster then libsvm, but on others it is the other way around. Explain why. Moreover, your code should be able to train the data of HW 6-1 (using the best parameters only but not doing the whole grid).
HW7: application paper presentation: (9-13, December 22 and 29, 2003). Each group (2 to 3 people) has 30 minutes for presentation. The order of papers will be the order of presentations.
This will give you an idea how SVM is applied in practice
- paper1 paper2 (link broken, may not be used) paper3 r92922012, b91902094
- paper4 paper5 paper6
- paper7 d88526014, d91922011,d92922013 paper8 paper9 b89902091, r92922028, r92922092
- paper10 f90921015 f91921020 d87921025 paper11 (r92922042 r92922059), paper12 B90902098, B90902100, B90902123
- paper13 paper14 paper15 r91522826, r91522808 paper16 r92522819, r92522822, r92546006
- paper17 r91922090,r92922109, d91921025 paper18, paper19 R92922069, R92922094, paper20.
- paper21, b89902003, b89902004, b89902027. paper22 B90902023, B90902077, paper23 D92922001, R92922004.
HW8 (due December 22)
HW8-1: derive the dual of SVR
HW8-2: Consider the smallest four problems in software/svm/regession_data/newsvrdata. They are regression data sets. You would like to randomly split each data to training and testing first. Then conduct cross validation on the training to choose the best parameters. Write a short report (<= 2 pages) to show your results and experience.

Final Projects

This is an important part of this course. Presentation will be at January 5, 2004 (30 minutes per group). Five persons each group. Each group has to turn in a report (<= 10 pages, in English; no word file please) by January 3. As these projects are highly research oriented, please start working on them as early as possible.

At December 1, each group please give a 10 to 15-minute presentation about your progress.

The svmtoy applet in libsvm homepage is a 2D example. Can you extend it to 3D ? The main difficulty is on testing so many points in a 3-D space.
There are several possible approaches. One is to consider some methods for speeding up SVM testing. There are some papers about this. On the other hand, you may think about any possibilities to approximate the curve in a 3-D space.
The java code of svm-toy is in the java directory of libsvm package.
SVM does not perform well on the data set of EUNITE 2, can you think about any way to use SVM and achieve 75% accuracy. The data set is in ~cjlin/software/svm/eunite2. The subdirectory RAW contains original data but norm39.tr and norm39.test are the main training and testing files for consideration (i.e. data of HW 1).
An earlier study of this data set is in the second part of the thesis at ~cjlin/latex/students/Bo-June.Chen.
SVM has not been fully extended to online learning. An essential part of doing this is the ability of efficiently adding/removing samples in the SVM training program. We already have some mechanism for this, so in the project, we hope to consider some simple applications and study the applicability of svm on online learning. ~cjlin/software/svm/libsvm/libsvm-2.4-loor.tgz is an example of modifying libsvm for calculating leave-one-out error. (this one is for regression; a version for classification is at ~b6506054/libsvm/loo) Using the same techniques, the code can be modified for an on-line setting.
SVM does not tell users which features are more important. Could you think about any effective approach for doing so ?
For example, we consider the first problem of KDD cup 2001 for the prediction of molecular bioactivity for drug design -- binding to thrombin.
The training file contains 1910 data with more than 100,000 features. Using the sparse format it is available here (in zip format). For testing we have 634 data which are available here (in zip format).
Before using SVM or other classification methods, we should select a few hundreds or thousands "important features." You would like to think of different ways for such a selection and conduct experiment to compare with existing results.
You might want to check the thesis in ~cjlin/software/svm/libsvm/papers/MatthiasHeiler.ps.gz An implementation is in ~cjlin/software/svm/libsvm/fselect.tgz
SVM for multi-labeling problems.
Multi-labelling is a different type of classification problems. A data instance may belong to more than one class. This project concerns about how to extend libsvm for such problems. First you need to collect some multi-labelling data sets. This can be done by checking some earlier work such as paper1, paper2, and references therein. The only multi-label problem in UCI machine learning repository is "university," which could also be considered.
Libsvm uses the so called 1vs1 (pairwise) voting strategy for multi-class classification. In the prediction phase, k(k-1)/2 votes are casted to k classes and we pick the one with the larget number of votes. The goal of this project is to constructa multi-labeling system based on the current 1vs1 strategy implemented in libsvm.
Now for multi-labelling problems, we may have to select the class with the second or third highest votes as well. You need to design a new rule for this selection process and conduct experiments to compare with existing approaches. For example, for each (C,g), you have four-fold as training and one-fold as validation. Using the four-fold you first train k(k-1)/2 models. With all f_ij and x as attributes and # of casses as target label, you conduct SVR or multi-class SVM to have model for predicting # of classes. Then for each of the validation set, you first calculate the k(k-1)/2 decision values and then predict the # of classes.
When conducting the pairwise training, ix x is in both class1 and class2, in the training file you can directly have
```
         class1 x
         class2 x
```
Is this right or not ? For the binary problem of class1 vs. class2, should we exclude data which are in both clases ? By excluding data in both classes, we hope that when these data are put into the model, the probability to be in either class is 0.5. If they are included, most likely the separating plane cut through it, so the same effect seems to occur. From this point of view, it may be fine to include data in the two classes.
Note that once the number of classes (labels) is determined, there are different ways to do the prediction. One is to simply select classes with the highest votes. Another is sequentially remove votes involved with the class picked earlier and then select the highest from the remaining. For example, if we would like to choose two classes and class 1 is the one with the highest votes, we remove all results of 1vs2, ..., 1vsk before selecting the other class.
You also would like to consider some extreme situations. For example, if x_i, i = 1, ..., l are all in classes 1, 2, and 3, will your system be able to correctly label data ? How about existing approaches ?
You may want to use the new version of libsvm (2.5), in which devision values can be easily accessed. In earlier version, decision values are not available for users. We have both C++ or python subroutines to access them. You also would like to check somi's svmprob example ins ~cjlin/htdocs/svmprob/svmprob-1.0.tgz.
The evaluation of such data sets is also a concern. Different criteria may have to be tried. You may then compare your approach with the so called "binary approach" mentioned in the paper by Elisseeff and Weston. This "binary approach" is based on "1 vs the rest." You can have a modification using 1vs1 as well.
In paper 12 of HW 7, the authors experimented with SVM on large three-class problems. The performance is not very satisfactory. As the authors did not conduct parameter selection while using SVM, we would like to redo the experiments with some model selection.
The authors have provided all data on the web. You have to transform it to our format. Reproducing the same results as in the paper is the first step and then we seek for any possible improvements.
It seems for this application, the number of support vectors is a concern. We may apply for some novel techniques to improve this aspect.
This project is for people intereted in parallel processing. Currently libsvm has a great python tool grid.py for parameter selection. It can easily dispatch jobs to different computers. However, for the R interface, the current tune() function can only be run serially. For this project, we would like to construct a parallel version. Some modifications of the snow packages in R may be needed. We hope that similar to grid.py, if some nodes fail, jobs will be returned to the queue and then dispatch to others.
libsvm has quite a few interfaces to other script languages (e.g. R, Matlab, Python, Ruby, and Perl). When running large data sets, some overheads may cause the computational time to be larger than that of directly using c++ binaries. We would like to investigate the behavior of using different interfaces and realize where bottlenecks are. This project is for people who are interested in script languages.
This is for people interested in computer architecture. For HW3, we have seen that running on a PIV-2.4 (linux2) is three times faster than PIV 2.0 (linux11) and seven times faster than PIII-1.0 (linux5). We know the bottleneck is at some level of data access. Your job is to identify and analyze this issue.
Finding some tools for analyzing memory usage or data locality may help.
ICDAR competition includes several pattern recognition problems such as page segmentation, word recognition, character recognition, and text locating. Most of them can be formulated as classification problems.
For this project, you would like to focus on one or two topics and propose your own methods to solve them.
Transductive learning is a method which tries to use the information of testing data. For regular classification problems, we assume that only training data is available. However, in many cases, test data with unknown labels are available, so in the training case, we would like to consider them as well. The idea is to use their attributes as their labels are not known.
Several papers have shown ways to do transductive SVMs. For example, the software svmlight has an implementation. In this project, you would like to re-implement existing methods (by modifying libsvm), conduct experiments, and, if possible, propose new approaches. Note that efficient implementation is not the most important thing here as we focus on getting some observations. Hence maybe you can use the python interface to save the time for implementation.
You may want to think about multi-class transductive learning as well.
EUNITE 3 competition (revised version)
range of svm parameter search
time series prediction tool

Exam

No exam.

Last modified: Tue Jan 13 14:43:11 CST 2004