Statistical Learning Theory


About this course

It will be taught in English. The goal of this course is to describe support vector machines, a technique for classification and regression.

Note that this course is highly research oriented. You want to take this course only if you are interested in doing related research.


Course Outline


Homework

Once every week. Please write your homework/reports in English. The homework report should be less than or equal to 2 pages.
For late homework, the score will be exponentially decreased.

Every week we will randomly select one to give a 10 to 15-minute presentation about his/her homework (in English). Everyone has to turn in his/her homework before this presentation. Rules: We do not require you to come every week. If you are absent and are selected for presentation, you will be required to do a presentation next week. If you failed to show up then, your final score will be deducted by 5 points. On the other hand, every week we seek for a volunteer first who will get 3 bonus points. When no one volunteers, everyone can be picked no matter you have given a presentation before or not.

  1. (due September 23) : Use k-nearest neighborhood to train combined_scale.bz2 and test combined_scale.t.bz2

  2. (due September 30): three problems given in the class

  3. (due October 7): The calculation of E[N]=2d and prove exp(-gamma edit(x,y)) not a valid kernel.

  4. (due October 14): two questions given in the class and prove exp(-gamma edit(x,y)) not a valid kernel.

  5. (due October 21): one question given in the class and prove exp(-gamma edit(x,y)) not a valid kernel.

  6. (due October 28): traing 5,000 samples of HW1 data dn test

  7. (due November 4):

  8. (due November 11):

  9. (due November 18):

  10. (due November 25):

  11. Application paper presentation: Each group has 30 minutes for presentation. The order of papers will be the order of presentations.
    This will give you an idea how SVM is applied in practice

    http://www.csee.usf.edu/%7Ehall/papers/plankton03a.pdf r93922020 r93944009

    http://portal.acm.org/citation.cfm?id=958479 R93922038, R93942036

    http://www.informedia.cs.cmu.edu/documents/acmm02_lin.pdf R93922013,R93922025

    http://www.araa.asn.au/acra/acra2003/papers/42.pdf r92922006 and r93546015

    http://springerlink.metapress.com/app/home/contribution.asp?wasp=4nc6a0wwrn2qyh8lwvtp&referrer=parent&backto=issue,105,169;journal,90,1768;linkingpublicationresults,1:105633,1 r91922113 r92922120

    http://www.ai.univie.ac.at/~elias/publications/kne_ismir04.pdf p92922007 d93922011

    http://www.comp.nus.edu.sg/~leews/publications/p31189-zhang.pdf r93922108, r93922140

    http://www.sls.csail.mit.edu/sls/publications/2004/saenko_icmi_04.pdf r92922054,

  12. - Explain why the following situation happens (this question is from Yu Lan at Tsinghua University)

  13. - Given in the class. Showing that linearly independent data are linearly separable.

    - Using libsvm to train train combined_scale.bz2 and test combined_scale.t.bz2.

    Due to the time limit, we consider only the RBF kernel which is the default mapping function. Hence, you need to conduct model selection for selecting C and gamma. This can be done by using the file grid.py provided in libsvm (under the directory python). It calculates the cross validation accuracy using different C and gamma (range specified by you) and draw a contour. The usage of grid.py is in the README file of the same directory. An libsvm option -m specifies how much memory to use. The default is 40 but you need more memory to save time. For this problem, -m 300 is enough.

    To restrict the search space, you can use choose.py to chose a subset first. After knowing the possible region of parameters, you run the whole set.

    After finding the best parameter, you train the whole training set and then predict the test set.

    Note: the model selection (i.e. cross validation) can be time consuming so you want to do this homework as early as possible.

  14. The RBF kernel as an inner product when n = 2

  15. - Dual of combined loss function.

    - we would like to study the effect of data scaling. In ~cjlin/htdocs/libsvmtools/binary there are some two-class problems in original and scaled formats. You first randomly split each file to training and testing. Conducting grid search of cross validation on the training and then predict the testing. You should observe that for non-scaled data, it is more difficult to locate good parameters and the testing performance is not good. Write a short (2-page) report to explain what you find.

  16. Write a simple SVM classifier using the SMO-type algorithm taught earlier.

  17. - Derive the dual of SVR

    - Consider the smallest four problems in software/svm/regession_data/newsvrdata. They are regression data sets. You would like to randomly split each data to training and testing first. Then conduct cross validation on the training to choose the best parameters. Write a short report (<= 2 pages) to show your results and experience.


Final Projects

This is an important part of this course.

Presentation will be at January 6, 2005 (30 minutes per group). 3-4 persons per group. Each group has to turn in a report (<= 10 pages, no MS word file please) by January 4 12am. As these projects are highly research oriented, please start working on them as early as possible.

On December 3, each group gives a 30-minute presentation about your progress.

Projects last year.

  1. The svmtoy applet in libsvm homepage is a 2D example. Can you extend it to 3D ? The main difficulty is on testing so many points in a 3-D space.

    Last year one group (~/htdocs/courses/slt2003/projects/svmtoy.pdf) has done some preliminary work, but they were unable to speed up the testing speed. Indeed there are some computer graphics techniques to approximate the nonlinear curve in a 3-D space.

    The java code of svm-toy is in the java directory of libsvm package. You can contact the group who did this project last year and obtain their code. The evaluation is whether your code can be really released to the community.

  2. Matlab interface of libsvm has not been updated for a while by its authors. It is interesting to design a good interface. This will help a lot to the community.

    The only judging criterion is whether your code is simple and good enough for release. It must be simple for future maintenance.

  3. A different working set selection for libsvm: This paper shows another way to select two elements in polynomial time (theorem 6.1). It is interesting to compare its performance with libsvm.

    You would like to compare the number of iterations and the computational time under different parameters. However, it is impossible to analyze results in so many parameters. One possibility is to compare

    You can contact b90098 as he may provide you some scripts of doing this comparison.

  4. This paper proposes a method to solve SVM QP problem and an implementation is available in R. We are interested in doing a timing comparison to libsvm (no need to compare iterations as they are very different approaches). As their approach solves problems with different C's at once, you need to think first how to design a fair comparison.

    You may also have to check the timing difference between libsvm and its R interface.

  5. In this paper, the authors propose a method for surface modeling. We are curious how it performs on other objects. For this project, you would like to This is my answer to Bernhard's question about modifying libsvm. It may help you to modify the code.

  6. Why in the this paper, SVR is worse than ANN? Usually we do not see this and it is interesting to see why.

  7. SVM does not perform well on the data set of EUNITE 2, can you think about any way to use SVM and achieve 75% accuracy. The data set is in ~cjlin/software/svm/eunite2. The subdirectory RAW contains original data but norm39.tr and norm39.test are the main training and testing files for consideration

    An earlier study of this data set is in the second part of the thesis at ~cjlin/latex/students/Bo-June.Chen.

  8. This project is for people intereted in parallel processing. Currently libsvm has a great python tool grid.py for parameter selection. It can easily dispatch jobs to different computers. However, for the R interface, the current tune() function can only be run serially. For this project, we would like to construct a parallel version. Some modifications of the snow packages in R may be needed. We hope that similar to grid.py, if some nodes fail, jobs will be returned to the queue and then dispatch to others.

  9. libsvm has quite a few interfaces to other script languages (e.g. R, Matlab, Python, Ruby, and Perl). When running large data sets, some overheads may cause the computational time to be larger than that of directly using c++ binaries. We would like to investigate the behavior of using different interfaces and realize where bottlenecks are. This project is for people who are interested in script languages.

  10. EUNITE 3 competition (revised version)

  11. How alpha seeding affects the grid?

  12. Polynomial kernel and scaling

  13. range of svm parameter search

  14. time series prediction tool

  15. Update libsvm-2.5-w to libsvm-2.6

Exam

No exam.

Grading

60% homework, 40% project. (tentative)
Last modified: Tue Dec 7 20:13:25 CST 2004