Data Mining and Machine Learning

Final score will be announced here. Approximately 10% will fail. (tentitive)
Instructor: Chih-Jen Lin, Room 413, CSIE building.
Your final score
TA: Yi-Wei Chen (email: b88052) Homework grade
BBS: ptt.cc; data mining board in ntu/csie
Time: Monday 10:20am-1:10pm, Room 107, CSIE building.
Note for this course: this course will be taught in English.
The course load is designed under the assumption that you are taking no more than four major courses in this semester. Hence, if you are not very interested in this course or do not plan to spend at least one fourth of your time on courses this semester, please do not take this course.
To get to the essence of things one has to work long and hard --- Vincent Van Gogh
No prerequisites for this course, so anyone (from Ph.D. to high school student) is welcome if you work hard and are enthusiasic about the topic.
Textbook: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Ian H. Witten and Eibe Frank.
Lecture slides: downloadable from the publisher's web site
Reference:
- The environment R. This will be the main environment used for the course.
  R is a powerful statistical environment and we think it is useful for machine learning as well. You are expected to learn it by yourself.
- Modern applied statistics with S. Venables and Ripley, Fourth edition, 2002.
- Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Past course pages: 2001 Fall, 2002 Fall

Course Outline (tentative)

Input: Concepts, Instances, Attributes
Output: Knowledge Representation
Algorithms: The Basic Methods
Credibility: Evaluating What's Been Learned
Implementations: Real Machine Learning Schemes
Engineering the Input and Output

Homework

Once every week. Please write your homework/reports in English.
For late homework, the score will be exponentially decreased.
Please print out your homework but not e-mail it to the TA.

Every week we will randomly select one to give a 10 to 15-minute presentation about his/her homework. Everyone has to turn in his/her homework before this presentation. Rules: We do not require you to come every week. If you are absent and are selected for presentation, you will be required to do a presentation next week. If you failed to show up then, your mid-term exam will be deducted by 15 points. On the other hand, every week we seek for a volunteer first who will get 10 bonus points for mid-term. However, you can do this only once in this course. When no one volunteers, everyone can be picked no matter you have presented before or not.

hw1, simple experiments using R (random forest on iris data), due February 23, 2004.
hw2, data transformation: random forest on usps, due March 1, 2004.
hw3, data statistics: random forest on 22features, due March 8, 2004.
hw4, manipulate different types of attributes, due March 15, 2004.
hw5, nearest neighbor on ijcnn1, due March 22, 2004.
hw6, compare 1R and Naive Bayes using CV accuracy, due March 29, 2004
hw7, k-fold CV on nearest neighbor, due April 12, 2004
hw8, nearest neighbor on vehicle data, due April 19, 2004
hw9, random forests and random ??, due April 26, 2004
hw10, rules and association rules in R, due May 3, 2004
hw11, regression, due May 10, 2004
hw12, svm: compare time of c++ and R code, due May 17, 2004. Note that svm slides are here
hw13, data scaling, due May 24, 2004
hw14, ROC curve, due May 31, 2004
hw?, svm: large data
hw?, objected-oriented CV code
hw?, random forests using your own tree

Exams

Midterm: April 12, 2004
Final: May 31, 2004
We will discuss the final exam on June 7, 2004 (the day for final presentation)

Final Project

We will have one final project. Project presentation: June 7. Each group: one or two persons and 30-minute presentation.

Yes, your presentation will be in English.

Possible topics

A continuing study on the scaling issue of the vehicle data.
Extend the snow package to have better socket support. Then use it for parallel parameter search for SVM.
Roughly speaking, if you have 100 jobs and 10 computers, under the current settings, you have to separate the 100 jobs to 10 groups first and then assign each group to a computer. Thus, the load balancing is bad. You would like to have a central scheduler so that sequentially a job is put into one free computer.
Redo the experiments of David Meyer but use different problems. (see the paper in ~cjlin/software/svm/libsvm/papers/davidmeyercomparison.pdf) (in particular, multi-class problems)
Basically you have to read the paper and think about what his way is for conducting such a huge comparison. Then you have to identify some multiclass data sets and conduct a similar comparison.
Compare different feature selection methods in R. Using data sets of nips 2003 competition.
A workshop in NIPS 2003 holds a feature selection competition. You identify available feature selection schemes in R and conduct a comparison. You can ask the TA for details of this competition. He participated in it and ranked 3rd/2nd in two evaluations.
Comparison of weka, yale, and R
weka is the software used in the textbook. They are both written in java. In my opinion, they are somewhat between data mining and machine learning software. That is, they are not as powerful as commercial data mining software on data preprocessing. However, comparing to machine learning environments, they already include good graphic interface and data preprocessing tools.
R comes from a different background: a general statistical computing environment. For this project, I would like you to check and list all features of weka and yale, and then find out how to do them in R.
PCA is a useful technical for dimension reduction. Somehow I feel that it is more often used in pattern recognition applications. Does this mean PCA is not that important for general machine learning benchmark data sets ? You would like to conduct a comparison and show your observation.
This project is for people who are interested in comparing Neural networks and SVM.
In the following paper, the authors compare NN and SVM. They reported that NN gives better results than SVM. As in general SVM can achieve similar performance as NN, we suspect that the authors may not use SVM in a correct way. Hence, we are interested in conducting a study.
Basically you would like to obtain their data and then redo the experiments.

Grading

30% homework, 30% project, 40% Exam. (tentative)

Related Information

Kdnuggets: a useful collection of data mining related software, book, and many other stuff.

Last modified: Tue Jun 8 13:42:58 CST 2004