This page contains many classification, regression, and multi-label data sets used in our papers. Many are from UCI, Statlog, StatLib and other collections. We really thank their efforts. For most sets, we directly transform the file into LIBSVM format and linearly scale each attribute to [-1,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set.
A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM
: a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Please also cite the source of the data sets (references given below).
Go to pages of classification (binary, multi-class), regression, and multi-label.
| name | source | type | class | training size | testing size | feature |
|---|---|---|---|---|---|---|
| a1a | UCI | classification | 2 | 1,605 | 30,956 | 123 |
| a2a | UCI | classification | 2 | 2,265 | 30,296 | 123 |
| a3a | UCI | classification | 2 | 3,185 | 29,376 | 123 |
| a4a | UCI | classification | 2 | 4,781 | 27,780 | 123 |
| a5a | UCI | classification | 2 | 6,414 | 26,147 | 123 |
| a6a | UCI | classification | 2 | 11,220 | 21,341 | 123 |
| a7a | UCI | classification | 2 | 16,100 | 16,461 | 123 |
| a8a | UCI | classification | 2 | 22,696 | 9,865 | 123 |
| a9a | UCI | classification | 2 | 32,561 | 16,281 | 123 |
| australian | Statlog | classification | 2 | 690 | 14 | |
| breast-cancer | UCI | classification | 2 | 683 | 10 | |
| colon-cancer | [AU99a] | classification | 2 | 62 | 2,000 | |
| covtype.binary | UCI | classification | 2 | 581,012 | 54 | |
| diabetes | UCI | classification | 2 | 768 | 8 | |
| duke breast-cancer | [MW01a] | classification | 2 | 44 | 7,129 | |
| fourclass | [TKH96a] | classification | 2 | 862 | 2 | |
| german.numer | Statlog | classification | 2 | 1,000 | 24 | |
| heart | Statlog | classification | 2 | 270 | 13 | |
| ijcnn1 | [DP01a] | classification | 2 | 49,990 | 91,701 | 22 |
| ionosphere | UCI | classification | 2 | 351 | 34 | |
| leukemia | [TG99a] | classification | 2 | 38 | 34 | 7129 |
| liver-disorders | UCI | classification | 2 | 345 | 6 | |
| mushrooms | UCI | classification | 2 | 8124 | 112 | |
| news20.binary | [SSK05a] | classification | 2 | 19,996 | 1,355,191 | |
| rcv1.binary | [DL04b] | classification | 2 | 20,242 | 677,399 | 47,236 |
| real-sim | A. McCallum | classification | 2 | 72,309 | 20,958 | |
| splice | Delve | classification | 2 | 1,000 | 2,175 | 60 |
| sonar | UCI | classification | 2 | 208 | 60 | |
| svmguide1 | [CWH03a] | classification | 2 | 3,089 | 4,000 | 4 |
| svmguide3 | [CWH03a] | classification | 2 | 1,243 | 41 | 21 |
| w1a | [JP98a] | classification | 2 | 2,477 | 47,272 | 300 |
| w2a | [JP98a] | classification | 2 | 3,470 | 46,279 | 300 |
| w3a | [JP98a] | classification | 2 | 4,912 | 44,837 | 300 |
| w4a | [JP98a] | classification | 2 | 7,366 | 42,383 | 300 |
| w5a | [JP98a] | classification | 2 | 9,888 | 39,861 | 300 |
| w6a | [JP98a] | classification | 2 | 17,188 | 32,561 | 300 |
| w7a | [JP98a] | classification | 2 | 24,692 | 25,057 | 300 |
| w8a | [JP98a] | classification | 2 | 49,749 | 14,951 | 300 |
| webspam | Webb Spam Corpus [ST06a] | classification | 2 | 350,000 | 16,609,143 | |
| connect-4 | UCI | classification | 3 | 67,557 | 126 | |
| covtype | UCI | classification | 7 | 581,012 | 54 | |
| dna | Statlog | classification | 3 | 2,000 | 1,186 | 180 |
| glass | UCI | classification | 6 | 214 | 9 | |
| iris | UCI | classification | 3 | 150 | 4 | |
| letter | Statlog | classification | 26 | 15,000 | 5,000 | 16 |
| mnist | [YL98a] | classification | 10 | 60,000 | 10,000 | 780 |
| mnist8m | Invariant SVM [GL07b] | classification | 10 | 8,100,000 | 784 | |
| news20 | [KL95a] | classification | 20 | 15,935 | 3,993 | 62,061 |
| poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10 |
| protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357 |
| satimage | Statlog | classification | 6 | 4,435 | 2,000 | 36 |
| sector | [AM98a] | classification | 105 | 6,412 | 3,207 | 55,197 |
| segment | Statlog | classification | 7 | 2,310 | 19 | |
| shuttle | Statlog | classification | 7 | 43,500 | 14,500 | 9 |
| svmguide2 | [CWH03a] | classification | 3 | 391 | 20 | |
| usps | [JJH94a] | classification | 10 | 7,291 | 2,007 | 256 |
| SensIT Vehicle (acoustic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
| SensIT Vehicle (seismic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
| SensIT Vehicle (combined) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 100 |
| vehicle | Statlog | classification | 4 | 846 | 18 | |
| vowel | UCI | classification | 11 | 528 | 462 | 10 |
| wine | UCI | classification | 3 | 178 | 13 | |
| abalone | UCI | regression | 4,177 | 8 | ||
| bodyfat | StatLib | regression | 252 | 14 | ||
| cadata | StatLib | regression | 20,640 | 8 | ||
| cpusmall | Delve | regression | 8,192 | 12 | ||
| housing | UCI | regression | 506 | 13 | ||
| mg | [GWF01a] | regression | 1,385 | 6 | ||
| mpg | UCI | regression | 392 | 7 | ||
| pyrim | UCI | regression | 74 | 27 | ||
| space_ga | StatLib | regression | 3,107 | 6 | ||
| triazines | UCI | regression | 186 | 60 | ||
| mediamill (exp1) | Mediamill | multi-label | 101 | 30,993 | 12,914 | 120 |
| rcv1v2 (topics; subsets) | [DL04b] | multi-label | 101 | 3,000 | 3,000 | 47,236 |
| rcv1v2 (topics; full sets) | [DL04b] | multi-label | 101 | 23,149 | 781,265 | 47,236 |
| rcv1v2 (industries; full sets) | [DL04b] | multi-label | 313 | 23,149 | 781,265 | 47,236 |
| rcv1v2 (regions; full sets) | [DL04b] | multi-label | 228 | 23,149 | 781,265 | 47,236 |
| scene-classification | [MB04a] | multi-label | 6 | 1,211 | 1,196 | 294 |
| siam-competition2007 | SIAM Text Mining Competition 2007 | multi-label | 22 | 21,519 | 7,077 | 30,438 |
| yeast | [AE02a] | multi-label | 14 | 1,500 | 917 | 103 |
We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.
Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.