Data Mining and Machine Learning
- Instructor:
Chih-Jen Lin, Room 413, CSIE building.
- TA: Tzu-Kuo Huang (email: b89034)
Homework grade
Final score will be here.
Approximately 10% will fail. (tentitive)
- BBS: ptt.cc; data mining board in ntu/csie
- Time: Monday 10:20am-1:10pm, Room 105, CSIE building.
Usually we have a 20-minute break at around 11:30am.
-
Note for this course:
This course will be taught in English.
The course load is designed under
the assumption that you are taking no more than
four major courses in this semester.
No prerequisites for this course, so anyone (from
high school to Ph.D. student) is welcome if you work
hard and are enthusiasic about the topic.
To get to the essence of things one has to work long and hard
--- Vincent Van Gogh
- Textbook:
Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations
(2nd edition) by
Ian H. Witten and Eibe Frank, 2005
- Lecture slides: downloadable from
the
publisher's web site
- SVM slides
- Reference:
-
Past course pages:
2005 Winter,
2004 Winter, 2002 Fall
Course Outline (tentative)
- Input: Concepts, Instances, Attributes
- Output: Knowledge Representation
- Algorithms: The Basic Methods
- Credibility: Evaluating What's Been Learned
- Implementations: Real Machine Learning Schemes
- Engineering the Input and Output
Homework
Once every week. Please write your homework/reports in English.
For late homework, the score will be
exponentially decreased.
Please print out your homework but not e-mail it to the TA.
Every week at around 12:50pm we randomly select one to present
his/her homework.
Moreover, you are required to turn in your homework before
the 20-minute break.
Rules: We do not require you to come every week. If you are
absent and are selected, you will be
required to do a presentation next week. If you fail
to show up then, your mid-term exam will be deducted by
15 points. On the other hand, every week we seek for a
volunteer first who will get 10 bonus points for
the mid-term. However, you can do this only once in this course.
When no one volunteers, everyone can be picked regardless of
whether you have presented some homework before or not.
- hw1, simple experiments using R (decision
trees on iris data), due Feb. 27, 2006.
- hw2, handling nominal attributes, due March 6, 2006.
- hw3, input format: sparse to others, due March 13, 2006.
- hw4, naive Bayes, due March 20, 2006.
- hw5, tree construction, due March 27, 2006.
- hw6, rule construction, due April 10, 2006.
- hw7, regression, due April 17, 2006.
- hw8, kmeans, due May 1, 2006.
- hw9, cross-validation, due May 8, 2006.
- hw10, ROC curve, due May 22, 2006.
- hw11, pruning of C4.5, due June 5, 2006.
- hw12, SVM on USPS, due June 12, 2006.
Exams
Final Project
We will have one final project. Project presentations:
May 22 and June 19.
Each group: a 25-minute presentation.
Please give me your final report (<= 10 pages) by June 16.
Each group has three/four members.
Yes, your presentation will be in English.
Possible project topics:
- In NIPS 2003, there is a
feature selection competition.
The winner first conducts some preprocessing
and then uses Bayesian networks.
What we are curious is whether their success is due
to the use of Bayesian networks.
In this project you would like to follow the winner's
all preprocessing steps and use a different classifier
(e.g., SVM). From such a study we will know which
factor contributes to their success.
-
In 2005 there is a competition
"Evaluating Predictive Uncertainty Challenge".
In that there are two classification problems.
Both training and validation labels are available.
In the project you would like to try different
methods and see if you can achieve as good validation
performance as participants.
-
KDD CUP 1999 is a classification competition.
For this project you would like to study/analyze its data
and find ways achiving the winner's performance.
If possible, you want to do better.
Grading
30% homework, 30% project, 40% Exam. (tentative)
Related Information
-
Kdnuggets:
a useful collection of data mining related software,
book, and many other stuff.
Last modified: Sun Jun 4 14:29:07 CST 2006