FAQ for the paper "Large linear classification when data cannot fit in memory"

We got some interesting comments about this paper, so we decided to write an FAQ. Please feel free to give us more comments.

Q: Why training data larger than memory as in general the accuracy does not improve much?

See the next question.

Q: Why considering data smaller than disk as we need to train larger data in a data center?

See the previous question.

Q: Why the above two questions are opposite to each other?

We don't really have an answer yet, but both points may be correct. In many cases, subsampling training data does not downgrade the prediction accuracy much, so there is no need to employ large-scale training techniques. This usually happens when the data quality is good. However, we also see situations where Internet companies collect a huge amount of web log and train data on a distributed learning system.

Currently, some people question the need of using large data, but some always think more is better. We think large-scale training is application dependent. According to different properties of the target applications, we decide how much data points are needed. This is still an important research issue for machine learning practice.

Q: So it seems that this work is somewhere between the above two viewpoints?

Yes. We hope to bridge the two very different viewpoints mentioned above. So far we haven't had many data sets larger than the memory capacity. If you need to train such large sets, we will be very interested in knowing your applications.

Q: What are other existing classification tools which can handle data larger than memory?

Many papers have proposed methods though they may not provide tools. One package that has been designed for such situations is VW at Yahoo!. It is an online algorithm so is slightly different from our off-line setting. If you know other tools, please let us know.

Q: Both your software and VW need to compress data. Does this mean that when data sets are larger than memory, many implementation considerations for machine learning algorithms are different?

We think so. In the past we didn't worry about system and file issues at all, but we need consider them for large-scale systems.

Q: Your implementation supports warm start, but no incremental/decremental learning yet?

It is on our todo list. If you have any potential applications needing incremental/decremetal settings, please contact us as we would love to learn more.

Q: From the above discussion, it seems you are not that sure yet if this work will eventually be useful, even though it receives KDD 2010 best paper award?

Yes, we don't really know yet. But this is why research is interesting.

Q: So you are very frank in evaluating your research work because many people always say how great and how useful their work is?

We really got such a comment at the KDD conference. We always try to be very honest in describing our work.

Please contact Chih-Jen Lin for any question. Last modified: Sat Sep 11 21:03:45 CST 2010