Data Mining and Machine Learning
- Final score will be announced here.
Approximately 10% will fail. (tentitive)
- Instructor:
Chih-Jen Lin, Room 413, CSIE building.
- TA: Yi-Wei Chen (email: b88052)
Homework grade
- BBS: ptt.cc; data mining board in ntu/csie
- Time: Monday 10:20am-1:10pm, Room 105, CSIE building.
-
Note for this course:
This course will be taught in English.
The course load is designed under
the assumption that you are taking no more than
four major courses in this semester.
No prerequisites for this course, so anyone (from
Ph.D. to high school student) is welcome if you work
hard and are enthusiasic about the topic.
To get to the essence of things one has to work long and hard
--- Vincent Van Gogh
- Textbook:
Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations
by
Ian H. Witten and Eibe Frank.
- Lecture slides: downloadable from
the
publisher's web site
SVM slides are here
- Reference:
-
Past course pages:
2004 Winter, 2002 Fall
Course Outline (tentative)
- Input: Concepts, Instances, Attributes
- Output: Knowledge Representation
- Algorithms: The Basic Methods
- Credibility: Evaluating What's Been Learned
- Implementations: Real Machine Learning Schemes
- Engineering the Input and Output
Homework
Once every week. Please write your homework/reports in English.
For late homework, the score will be
exponentially decreased.
Please print out your homework but not e-mail it to the TA.
Every week we will randomly select one to give a 10 to 15-minute
presentation about his/her homework.
Everyone has to turn in his/her homework before this
presentation.
Rules: We do not require you to come every week. If you are
absent and are selected for presentation, you will be
required to do a presentation next week. If you failed
to show up then, your mid-term exam will be deducted by
15 points. On the other hand, every week we seek for a
volunteer first who will get 10 bonus points for
mid-term. However, you can do this only once in this course.
When no one volunteers, everyone can be picked no
matter you have presented some homework before or not.
Moreover, you are required to turn in your homework before
the 20-minute break.
- hw1, simple experiments using R (1-nearest neighbor on iris data), due March 7, 2005.
- hw2, data transformation using R (1-nearest neighbor on usps data), due March 14, 2005.
- hw3, data statistics: covtype data set, due March 21, 2005.
- hw4, knn cv on covtype data set, due March 28, 2005.
- hw5, knn cv on covtype data set, due April 18, 2005.
- hw6, 1-R and Naive Bayes on covtype data set, due April 25, 2005.
- hw7, random forest on covtype data set, due May 2, 2005.
- hw8, covtype data set: multi-class setting, due May 9, 2005.
- hw9, regression example: US preseidential elections, due May 16, 2005.
- hw10, checking probability estimates in an earlier paper, May 30, 2005.
- hw11, ROC curve, June 6, 2005.
- hw12, svm: covtype, June 13, 2005.
Exams
Final Project
We will have one final project. Project presentation:
May 23 and June 20.
Each group: 25-minute presentation.
Please give me your final report by June 16.
Yes, your presentation will be in English.
Project description:
In this file you can find the log of how
libsvm was downloaded
in the past. We are interested in mining this file.
Some issues to investigate include
- Identify failed downloads. For example,
you see several attempts from the same place
in a short period of time. Investigate why they happened?
- From where failed attempts come from (or other
information),
could you identify
countries which have fast/slow connections to Taiwan?
- Any weekly pattern of downloading?
Do people work on Sunday afternoon or not?
- Hourly pattern vs. countries and time zones. Where do
people come from?
- Holidays: Christmas vs. Chinese new year. When do we see
fewer downloads? Is this related to the majority of users
are from other countries?
- Percentage of crawlers
- Identify any courses; why people from the same institute download many in a short period of time? Were they doing a homework using the software?
- OS and browser used. Are more and more people using linux?
What is the trend?
- Trend of the software. From the download
records of the software, is it going up, down, or stable?
- What is the difference between those with and without
domain names?
- zip vs .tar.gz; is this related why the same person
clicks twice. Could you decide which format people prefer?
- Which site downloads the page more often than others
- Could you identify users' areas? For
example, from domain names you
may see some are computer science people, but some are
in medical areas.
- Starting from January 2005 we have a matlab interface
to libsvm.
This file
shows information of users downloading the interface.
Do these users also download libsvm or not?
From the log files, could you conclude
how improtant the matlab interface is?
- why we cannot google http://www.cs.nmsu.edu/~ipivkina/cs579/Homework/hw4.html
- could we conclude mit or nmsu students work harder?
http://courses.csail.mit.edu/6.869/psets/ps4/ps4.pdf
- why there is no download during this course?
http://chicago05.mlss.cc/tiki/tiki-read_article.php?articleId=2
Grading
30% homework, 30% project, 40% Exam. (tentative)
Related Information
-
Kdnuggets:
a useful collection of data mining related software,
book, and many other stuff.
Last modified: Sat May 28 18:57:02 CST 2005