Data Mining and Machine Learning: Theory and Practice

Our team from National Taiwan University wins KDD cup 2010

See the competition results.

Our paper, talk slides at KDD cup 2010 workshop, and more complete slides

A brief description of our approach: The 19 students and one non-registered RA were split to seven groups. Six groups expand features by various binarization and discretization techniques. The resulting sparse feature sets are trained by logistic regression (using LIBLINEAR). One group condenses features so that the number is less than 20. Then random forest is applied (using Weka). Initial development was conducted on an internal split of training data for training and validation. We identify some useful feature combination. For the final submission, each group submits a few results and TAs ensemble them by linear regression.


Course Details


Course Outline

While it is possible to learn a variety of classification, clustering and other mining techniques from lectures or books, applying them efficiently and accurately to the real-world data is a completely different story. Very often a painful process of trial and error is needed. While dealing with the practical issues on data is rather an art than science, in this course, we try to gain experiences from tackling some real-world problems proposed as the past or ongoing competitions in machine learning or data mining society. In particular, we aim at attending the ACM KDD CUP, which is currently the most prestigious data mining competition. We expect to run this course in an interactive way, so students must discuss with the lecturers and other classmates about their findings as well as the problems they encountered every week.

Course Format (tentative)

More details will be on the wiki. Since this course is the first of its kind, the setting may not be perfect. Your comments and suggestions are welcome. Moreover, your active participation will help to make this course a success.

Exams

No exams

Grading

It will be based on your results and presentations every week.
Last modified: Wed Mar 16 12:08:01 CST 2011