Data Mining and Machine Learning

Final score will be announced here. Approximately 10% will fail. (tentitive)
Instructor: Chih-Jen Lin, Room 413, CSIE building.
TA: Yi-Wei Chen (email: b88052)
Homework grade
BBS: ptt.cc; data mining board in ntu/csie
Time: Monday 10:20am-1:10pm, Room 105, CSIE building.
Note for this course:
This course will be taught in English.
The course load is designed under the assumption that you are taking no more than four major courses in this semester.
No prerequisites for this course, so anyone (from Ph.D. to high school student) is welcome if you work hard and are enthusiasic about the topic.
To get to the essence of things one has to work long and hard --- Vincent Van Gogh
Textbook: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Ian H. Witten and Eibe Frank.
Lecture slides: downloadable from the publisher's web site
SVM slides are here
Reference:
- The environment R. This will be the main environment used for the course.
  R is a powerful statistical environment and we think it is useful for machine learning as well. You are expected to learn it by yourself.
- Modern applied statistics with S. Venables and Ripley, Fourth edition, 2002.
- Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Past course pages: 2004 Winter, 2002 Fall

Course Outline (tentative)

Input: Concepts, Instances, Attributes
Output: Knowledge Representation
Algorithms: The Basic Methods
Credibility: Evaluating What's Been Learned
Implementations: Real Machine Learning Schemes
Engineering the Input and Output

Homework

Once every week. Please write your homework/reports in English.
For late homework, the score will be exponentially decreased.
Please print out your homework but not e-mail it to the TA.

Every week we will randomly select one to give a 10 to 15-minute presentation about his/her homework. Everyone has to turn in his/her homework before this presentation. Rules: We do not require you to come every week. If you are absent and are selected for presentation, you will be required to do a presentation next week. If you failed to show up then, your mid-term exam will be deducted by 15 points. On the other hand, every week we seek for a volunteer first who will get 10 bonus points for mid-term. However, you can do this only once in this course. When no one volunteers, everyone can be picked no matter you have presented some homework before or not. Moreover, you are required to turn in your homework before the 20-minute break.

hw1, simple experiments using R (1-nearest neighbor on iris data), due March 7, 2005.
hw2, data transformation using R (1-nearest neighbor on usps data), due March 14, 2005.
hw3, data statistics: covtype data set, due March 21, 2005.
hw4, knn cv on covtype data set, due March 28, 2005.
hw5, knn cv on covtype data set, due April 18, 2005.
hw6, 1-R and Naive Bayes on covtype data set, due April 25, 2005.
hw7, random forest on covtype data set, due May 2, 2005.
hw8, covtype data set: multi-class setting, due May 9, 2005.
hw9, regression example: US preseidential elections, due May 16, 2005.
hw10, checking probability estimates in an earlier paper, May 30, 2005.
hw11, ROC curve, June 6, 2005.
hw12, svm: covtype, June 13, 2005.

Exams

Midterm: April 11, 2004
Final: June 13, 2004
We will discuss the final exam on June 20, 2004 (the day for your final presentation)

Final Project

We will have one final project. Project presentation: May 23 and June 20. Each group: 25-minute presentation. Please give me your final report by June 16.

Yes, your presentation will be in English.

Project description: In this file you can find the log of how libsvm was downloaded in the past. We are interested in mining this file. Some issues to investigate include

Identify failed downloads. For example, you see several attempts from the same place in a short period of time. Investigate why they happened?

From where failed attempts come from (or other information), could you identify countries which have fast/slow connections to Taiwan?

Any weekly pattern of downloading? Do people work on Sunday afternoon or not?

Hourly pattern vs. countries and time zones. Where do people come from?

Holidays: Christmas vs. Chinese new year. When do we see fewer downloads? Is this related to the majority of users are from other countries?

Percentage of crawlers

Identify any courses; why people from the same institute download many in a short period of time? Were they doing a homework using the software?

OS and browser used. Are more and more people using linux? What is the trend?

Trend of the software. From the download records of the software, is it going up, down, or stable?

What is the difference between those with and without domain names?