Data Mining and Machine Learning: Theory and Practice 2011

 

Instructors: Prof. Shou-de Lin
                    Prof. Chih-jen Lin
                    Prof. Hsuan-tien Lin

Classroom: CSIE 105

Meeting Time: Wed 9:10am~12:00

Office Hour:  After class or by appointment

TA: Tim Kuo <d97944007@csie.ntu.edu.tw>, Todd G. McKenzie <d97041@csie.ntu.edu.tw>, Chen-Tse Tsai<ctse.tsai@gmail.com>

Course Description:

While it is possible to learn a variety of machine learning and data mining theories from lectures or books, applying them accurately and efficiently to the real-world data is a completely different story. Very often data miners have to suffer a painful process of trial and error due to lack of experience. Dealing with the practical issues on data is rather an art than science, nevertheless, in this course we try to build up our experiences from tackling a real-world problem proposed as the ongoing competitions in data mining society. In particular, we aim at attending the ACM KDD CUP 2011, which is currently the most prestigious data mining competition. We expect to run this course in an interactive way, so students must discuss with the lecturers and other classmates about their findings as well as the problems they encountered every week.

Pre-requisite courses:
You have to take at least one of the following courses (two or more is even better):
    Machine Learning 
    Statistical Artificial Intelligence
    Optimization and Machine Learning

Courses Format and Loading:
You need to implement different kinds of intelligent systems for the competition and run extensive experiments to verify them. You will compete with the other students in the class as well as  other teams all over the world in KDD CUP. Note that this is an extremely intensive course. The students will have WEEKLY presentation about your progress in the previous week. Since the estimated time spent on this course is at least 10 hours per week, we in general need an approval from your advisor to attend it if you are a graduate student. 

Grades:
It will depend on your weekly performance (judged by your efforts, novelty, and presentation), and weighted by how much you contribute to the overall competition results.

Syllabus:

This course started from Nov 30, 2010 until June 30, 2011 ( you need to commit until June 30 if you want to participate this course). If you have interests to take this course, you need to send an email (sdlin@csie.ntu.edu.tw) to the instructor ASAP.

Date Topics  Notes
30-Nov Course Description & Yahoo Music Data Description  
14-Dec Overview of Recommendation Systems  
28-Dec Netflix Winners' Reports  
18-Jan Model-based CF approach  
25-Jan Random-walk based CF approach  
15-Feb Optimization for CF  
23-Feb First Class (overview of the class and what we have done so far) Working on Yahoo music dataset
2-Mar TBD afterwards Working on Yahoo music dataset
9-Mar   Working on Yahoo music dataset
16-Mar   3/15: competition begins
23-Mar   working on KDDCUP 2011 dataset
30-Mar    
6-Apr    
13-Apr    
20-Apr    
27-Apr    
4-May    
11-May    
18-May    
25-May    
1-Jun    
8-Jun    
15-Jun    
22-Jun    
30-Jun   Competition Ends

 

FAQ (modified from last years FAQ):

Q: I am interested in learning data mining and machine learning methods. Is this course the place to go?

This course aims at attending data mining competitions (i.e., KDD CUP). So this is not a place for you to learn basic materials of machine learning and data mining. We suppose you already know the basics. Therefore we require the participants have taken some preliminary courses (See above).


Q: What is the capacity of this class? Do we work individually or form teams?

To make sure we provide sufficient supports to every student in the class, we plan to take no more than 25 students in this class. If there are more than 25 students express the interests to join, we would select based on their prerequisite knowledge and intension. Students form teams (3 person each team) in this class.


Q: May I audit this course?

In general the answer is no, because you will not learn a lot without getting your hands dirty in this class. We don't want to waste your time and we hope every member in the class indeed spends significant amount of efforts on the competition.


Q: How about the course load?

Please anticipate spending at least 10 hours per week on this course. Simply put this: the more efforts you put in, the better results you will get. When your fellow classmates spend (or have to spend) lots of time and efforts on this, you will not be competitive if you don't. 


Q: Where can I find details of this course?

We have a homepage (as you are reading it). However, the course wiki will be the main place to give details. You will see our progress on the competitions there. Every enrolled student will get a wiki account.


Q: Is there any homework?

You have one single homework (that is, the music recommendation problem in KDDCUP 2011) throughout the whole course. You need to give a 20 min presentation on your progress EVERY WEEK.


Q: Because of team work, can I rely on some smart teammates?

No, you should work as hard as others. We will find a way to evaluate each individual student's performance.


Q: I tried many new ways in the past week, but all gave worse results. What should I present?

Failed approaches indeed show something. You should frankly present what you have tried. Competition results are related but not completely related to your final scores. We encourage creative thinking and out-of-the-box ideas. Novel ideas will be rewarded even if it is not proven by you to be useful.


Q: What kinds of computational resources do I need for this course? Will you provide any?

In general the department's machines (e.g., 217) should be enough. We will also provide some machines we purchased for this competition.


Q: Will each team/student submit results to KDD Cup 2011?

It depends on lots of factors and the instructors will decide what is the best strategy for submission when the time is closer. It is possible that we will only allow selective teams to submit results and/or to form a new ensemble of teams for submission. In any case, every team's contribution (with ideas and either positive or negative results) would be fairly acknowledged. At the current point, the policy is that no individuals nor teams may submit their results to KDD Cup 2011 unless granted by the instructors in advance. Violating the policy would lead to serious punishments.


Q: Is possible that I fail to pass this course?

Of course. You pass only if you work hard enough. (Similarly, in industry, underperformers will be fired).


Q: How good your team did in the past years?

Well, we were not perfect, of course, but we did ok.
This year's performance will be considered satisfiable if similar to the past year.