Statistical Learning Theory


About this course

The goal of this course is to describe a new approach to pattern recognition problems which originated within learning theory. We will especially concentrate on the support vector estimation, a very promising classification technique.

Note that this course is highly research oriented. You want to take this course only if you are interested in doing related research.


Course Outline


Homework

Once every two weeks. Please write your homework/reports in English.
For late homework, the score will be exponentially decreased.

Final Projects

This is an important part of this course. Presentation will be at January 5, 2004 (30 minutes per group). Five persons each group. Each group has to turn in a report (<= 10 pages, in English; no word file please) by January 3. As these projects are highly research oriented, please start working on them as early as possible.

At December 1, each group please give a 10 to 15-minute presentation about your progress.

  1. The svmtoy applet in libsvm homepage is a 2D example. Can you extend it to 3D ? The main difficulty is on testing so many points in a 3-D space.

    There are several possible approaches. One is to consider some methods for speeding up SVM testing. There are some papers about this. On the other hand, you may think about any possibilities to approximate the curve in a 3-D space.

    The java code of svm-toy is in the java directory of libsvm package.

  2. SVM does not perform well on the data set of EUNITE 2, can you think about any way to use SVM and achieve 75% accuracy. The data set is in ~cjlin/software/svm/eunite2. The subdirectory RAW contains original data but norm39.tr and norm39.test are the main training and testing files for consideration (i.e. data of HW 1).

    An earlier study of this data set is in the second part of the thesis at ~cjlin/latex/students/Bo-June.Chen.

  3. SVM has not been fully extended to online learning. An essential part of doing this is the ability of efficiently adding/removing samples in the SVM training program. We already have some mechanism for this, so in the project, we hope to consider some simple applications and study the applicability of svm on online learning. ~cjlin/software/svm/libsvm/libsvm-2.4-loor.tgz is an example of modifying libsvm for calculating leave-one-out error. (this one is for regression; a version for classification is at ~b6506054/libsvm/loo) Using the same techniques, the code can be modified for an on-line setting.

  4. SVM does not tell users which features are more important. Could you think about any effective approach for doing so ?

    For example, we consider the first problem of KDD cup 2001 for the prediction of molecular bioactivity for drug design -- binding to thrombin.

    The training file contains 1910 data with more than 100,000 features. Using the sparse format it is available here (in zip format). For testing we have 634 data which are available here (in zip format).

    Before using SVM or other classification methods, we should select a few hundreds or thousands "important features." You would like to think of different ways for such a selection and conduct experiment to compare with existing results.

    You might want to check the thesis in ~cjlin/software/svm/libsvm/papers/MatthiasHeiler.ps.gz An implementation is in ~cjlin/software/svm/libsvm/fselect.tgz

  5. SVM for multi-labeling problems.

    Multi-labelling is a different type of classification problems. A data instance may belong to more than one class. This project concerns about how to extend libsvm for such problems. First you need to collect some multi-labelling data sets. This can be done by checking some earlier work such as paper1, paper2, and references therein. The only multi-label problem in UCI machine learning repository is "university," which could also be considered.

    Libsvm uses the so called 1vs1 (pairwise) voting strategy for multi-class classification. In the prediction phase, k(k-1)/2 votes are casted to k classes and we pick the one with the larget number of votes. The goal of this project is to constructa multi-labeling system based on the current 1vs1 strategy implemented in libsvm.

    Now for multi-labelling problems, we may have to select the class with the second or third highest votes as well. You need to design a new rule for this selection process and conduct experiments to compare with existing approaches. For example, for each (C,g), you have four-fold as training and one-fold as validation. Using the four-fold you first train k(k-1)/2 models. With all f_ij and x as attributes and # of casses as target label, you conduct SVR or multi-class SVM to have model for predicting # of classes. Then for each of the validation set, you first calculate the k(k-1)/2 decision values and then predict the # of classes.

    When conducting the pairwise training, ix x is in both class1 and class2, in the training file you can directly have

             class1 x
             class2 x
    
    Is this right or not ? For the binary problem of class1 vs. class2, should we exclude data which are in both clases ? By excluding data in both classes, we hope that when these data are put into the model, the probability to be in either class is 0.5. If they are included, most likely the separating plane cut through it, so the same effect seems to occur. From this point of view, it may be fine to include data in the two classes.

    Note that once the number of classes (labels) is determined, there are different ways to do the prediction. One is to simply select classes with the highest votes. Another is sequentially remove votes involved with the class picked earlier and then select the highest from the remaining. For example, if we would like to choose two classes and class 1 is the one with the highest votes, we remove all results of 1vs2, ..., 1vsk before selecting the other class.

    You also would like to consider some extreme situations. For example, if x_i, i = 1, ..., l are all in classes 1, 2, and 3, will your system be able to correctly label data ? How about existing approaches ?

    You may want to use the new version of libsvm (2.5), in which devision values can be easily accessed. In earlier version, decision values are not available for users. We have both C++ or python subroutines to access them. You also would like to check somi's svmprob example ins ~cjlin/htdocs/svmprob/svmprob-1.0.tgz.

    The evaluation of such data sets is also a concern. Different criteria may have to be tried. You may then compare your approach with the so called "binary approach" mentioned in the paper by Elisseeff and Weston. This "binary approach" is based on "1 vs the rest." You can have a modification using 1vs1 as well.

  6. In paper 12 of HW 7, the authors experimented with SVM on large three-class problems. The performance is not very satisfactory. As the authors did not conduct parameter selection while using SVM, we would like to redo the experiments with some model selection.

    The authors have provided all data on the web. You have to transform it to our format. Reproducing the same results as in the paper is the first step and then we seek for any possible improvements.

    It seems for this application, the number of support vectors is a concern. We may apply for some novel techniques to improve this aspect.

  7. This project is for people intereted in parallel processing. Currently libsvm has a great python tool grid.py for parameter selection. It can easily dispatch jobs to different computers. However, for the R interface, the current tune() function can only be run serially. For this project, we would like to construct a parallel version. Some modifications of the snow packages in R may be needed. We hope that similar to grid.py, if some nodes fail, jobs will be returned to the queue and then dispatch to others.

  8. libsvm has quite a few interfaces to other script languages (e.g. R, Matlab, Python, Ruby, and Perl). When running large data sets, some overheads may cause the computational time to be larger than that of directly using c++ binaries. We would like to investigate the behavior of using different interfaces and realize where bottlenecks are. This project is for people who are interested in script languages.

  9. This is for people interested in computer architecture. For HW3, we have seen that running on a PIV-2.4 (linux2) is three times faster than PIV 2.0 (linux11) and seven times faster than PIII-1.0 (linux5). We know the bottleneck is at some level of data access. Your job is to identify and analyze this issue.

    Finding some tools for analyzing memory usage or data locality may help.

  10. ICDAR competition includes several pattern recognition problems such as page segmentation, word recognition, character recognition, and text locating. Most of them can be formulated as classification problems.

    For this project, you would like to focus on one or two topics and propose your own methods to solve them.

  11. Transductive learning is a method which tries to use the information of testing data. For regular classification problems, we assume that only training data is available. However, in many cases, test data with unknown labels are available, so in the training case, we would like to consider them as well. The idea is to use their attributes as their labels are not known.

    Several papers have shown ways to do transductive SVMs. For example, the software svmlight has an implementation. In this project, you would like to re-implement existing methods (by modifying libsvm), conduct experiments, and, if possible, propose new approaches. Note that efficient implementation is not the most important thing here as we focus on getting some observations. Hence maybe you can use the python interface to save the time for implementation.

    You may want to think about multi-class transductive learning as well.

  12. EUNITE 3 competition (revised version)

  13. range of svm parameter search

  14. time series prediction tool

Exam

No exam.
Last modified: Tue Jan 13 14:43:11 CST 2004