The course load is designed under the assumption that you are taking no more than four major courses in this semester. Hence, if you are not very interested in this course or do not plan to spend at least one fourth of your time on courses this semester, please do not take this course.
To get to the essence of things one has to work long and hard --- Vincent Van Gogh
No prerequisites for this course, so anyone (from Ph.D. to high school student) is welcome if you work hard and are enthusiasic about the topic.
R is a powerful statistical environment and we think it is useful for machine learning as well. You are expected to learn it by yourself.
Every week we will randomly select one to give a 10 to 15-minute presentation about his/her homework. Everyone has to turn in his/her homework before this presentation. Rules: We do not require you to come every week. If you are absent and are selected for presentation, you will be required to do a presentation next week. If you failed to show up then, your mid-term exam will be deducted by 15 points. On the other hand, every week we seek for a volunteer first who will get 10 bonus points for mid-term. However, you can do this only once in this course. When no one volunteers, everyone can be picked no matter you have presented before or not.
We will discuss the final exam on June 7, 2004 (the day for final presentation)
Yes, your presentation will be in English.
Possible topics
Roughly speaking, if you have 100 jobs and 10 computers, under the current settings, you have to separate the 100 jobs to 10 groups first and then assign each group to a computer. Thus, the load balancing is bad. You would like to have a central scheduler so that sequentially a job is put into one free computer.
Basically you have to read the paper and think about what his way is for conducting such a huge comparison. Then you have to identify some multiclass data sets and conduct a similar comparison.
A workshop in NIPS 2003 holds a feature selection competition. You identify available feature selection schemes in R and conduct a comparison. You can ask the TA for details of this competition. He participated in it and ranked 3rd/2nd in two evaluations.
weka is the software used in the textbook. They are both written in java. In my opinion, they are somewhat between data mining and machine learning software. That is, they are not as powerful as commercial data mining software on data preprocessing. However, comparing to machine learning environments, they already include good graphic interface and data preprocessing tools.
R comes from a different background: a general statistical computing environment. For this project, I would like you to check and list all features of weka and yale, and then find out how to do them in R.
In the following paper, the authors compare NN and SVM. They reported that NN gives better results than SVM. As in general SVM can achieve similar performance as NN, we suspect that the authors may not use SVM in a correct way. Hence, we are interested in conducting a study.
Basically you would like to obtain their data and then redo the experiments.