This page contains many classification, regression, and multi-label data sets stored in LIBSVM format. Many are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM
: a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Please also cite the source of the data sets (references given below).
Go to pages of classification (binary, multi-class), regression, and multi-label.
| name | source | type | class | training size | testing size | feature |
|---|---|---|---|---|---|---|
| a1a | UCI | classification | 2 | 1,605 | 30,956 | 123 |
| a2a | UCI | classification | 2 | 2,265 | 30,296 | 123 |
| a3a | UCI | classification | 2 | 3,185 | 29,376 | 123 |
| a4a | UCI | classification | 2 | 4,781 | 27,780 | 123 |
| a5a | UCI | classification | 2 | 6,414 | 26,147 | 123 |
| a6a | UCI | classification | 2 | 11,220 | 21,341 | 123 |
| a7a | UCI | classification | 2 | 16,100 | 16,461 | 123 |
| a8a | UCI | classification | 2 | 22,696 | 9,865 | 123 |
| a9a | UCI | classification | 2 | 32,561 | 16,281 | 123 |
| australian | Statlog | classification | 2 | 690 | 14 | |
| breast-cancer | UCI | classification | 2 | 683 | 10 | |
| cod-rna | [AVU06a] | classification | 2 | 59535 | 8 | |
| colon-cancer | [AU99a] | classification | 2 | 62 | 2,000 | |
| covtype.binary | UCI | classification | 2 | 581,012 | 54 | |
| diabetes | UCI | classification | 2 | 768 | 8 | |
| duke breast-cancer | [MW01a] | classification | 2 | 44 | 7,129 | |
| epsilon | PASCAL Challenge 2008 | classification | 2 | 400,000 | 100,000 | 2,000 |
| fourclass | [TKH96a] | classification | 2 | 862 | 2 | |
| german.numer | Statlog | classification | 2 | 1,000 | 24 | |
| gisette | NIPS 2003 Feature Selection Challenge [IG05a] | classification | 2 | 6,000 | 1,000 | 5,000 |
| heart | Statlog | classification | 2 | 270 | 13 | |
| ijcnn1 | [DP01a] | classification | 2 | 49,990 | 91,701 | 22 |
| ionosphere | UCI | classification | 2 | 351 | 34 | |
| kdd2010 (algebra) | KDD CUP 2010 | classification | 2 | 8,407,752 | 510,302 | 20,216,830 |
| kdd2010 (bridge to algebra) | KDD CUP 2010 | classification | 2 | 19,264,097 | 748,401 | 29,890,095 |
| leukemia | [TG99a] | classification | 2 | 38 | 34 | 7129 |
| liver-disorders | UCI | classification | 2 | 345 | 6 | |
| mushrooms | UCI | classification | 2 | 8124 | 112 | |
| news20.binary | [SSK05a] | classification | 2 | 19,996 | 1,355,191 | |
| rcv1.binary | [DL04b] | classification | 2 | 20,242 | 677,399 | 47,236 |
| real-sim | A. McCallum | classification | 2 | 72,309 | 20,958 | |
| splice | Delve | classification | 2 | 1,000 | 2,175 | 60 |
| sonar | UCI | classification | 2 | 208 | 60 | |
| svmguide1 | [CWH03a] | classification | 2 | 3,089 | 4,000 | 4 |
| svmguide3 | [CWH03a] | classification | 2 | 1,243 | 41 | 21 |
| url | [JM09a] | classification | 2 | 2,396,130 | 3,231,961 | |
| w1a | [JP98a] | classification | 2 | 2,477 | 47,272 | 300 |
| w2a | [JP98a] | classification | 2 | 3,470 | 46,279 | 300 |
| w3a | [JP98a] | classification | 2 | 4,912 | 44,837 | 300 |
| w4a | [JP98a] | classification | 2 | 7,366 | 42,383 | 300 |
| w5a | [JP98a] | classification | 2 | 9,888 | 39,861 | 300 |
| w6a | [JP98a] | classification | 2 | 17,188 | 32,561 | 300 |
| w7a | [JP98a] | classification | 2 | 24,692 | 25,057 | 300 |
| w8a | [JP98a] | classification | 2 | 49,749 | 14,951 | 300 |
| webspam | Webb Spam Corpus [ST06a] | classification | 2 | 350,000 | 16,609,143 | |
| connect-4 | UCI | classification | 3 | 67,557 | 126 | |
| covtype | UCI | classification | 7 | 581,012 | 54 | |
| dna | Statlog | classification | 3 | 2,000 | 1,186 | 180 |
| glass | UCI | classification | 6 | 214 | 9 | |
| iris | UCI | classification | 3 | 150 | 4 | |
| letter | Statlog | classification | 26 | 15,000 | 5,000 | 16 |
| mnist | [YL98a] | classification | 10 | 60,000 | 10,000 | 780 |
| mnist8m | Invariant SVM [GL07b] | classification | 10 | 8,100,000 | 784 | |
| news20 | [KL95a] | classification | 20 | 15,935 | 3,993 | 62,061 |
| pendigits | UCI | classification | 10 | 7,494 | 3,498 | 16 |
| poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10 |
| protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357 |
| rcv1.multiclass | [DL04b] | classification | 53 | 15,564 | 518,571 | 47,236 |
| satimage | Statlog | classification | 6 | 4,435 | 2,000 | 36 |
| sector | [AM98a] | classification | 105 | 6,412 | 3,207 | 55,197 |
| segment | Statlog | classification | 7 | 2,310 | 19 | |
| shuttle | Statlog | classification | 7 | 43,500 | 14,500 | 9 |
| svmguide2 | [CWH03a] | classification | 3 | 391 | 20 | |
| svmguide4 | [CWH03a] | classification | 6 | 300 | 312 | 10 |
| usps | [JJH94a] | classification | 10 | 7,291 | 2,007 | 256 |
| SensIT Vehicle (acoustic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
| SensIT Vehicle (seismic) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 50 |
| SensIT Vehicle (combined) | Sensit [MD04a] | classification | 3 | 78,823 | 19,705 | 100 |
| vehicle | Statlog | classification | 4 | 846 | 18 | |
| vowel | UCI | classification | 11 | 528 | 462 | 10 |
| wine | UCI | classification | 3 | 178 | 13 | |
| abalone | UCI | regression | 4,177 | 8 | ||
| bodyfat | StatLib | regression | 252 | 14 | ||
| cadata | StatLib | regression | 20,640 | 8 | ||
| cpusmall | Delve | regression | 8,192 | 12 | ||
| E2006-log1p | 10-K Corpus | regression | 16,087 | 3,308 | 4,272,227 | |
| E2006-tfidf | 10-K Corpus | regression | 16,087 | 3,308 | 150,360 | |
| eunite2001 | regression | 336 | 31 | 16 | ||
| housing | UCI | regression | 506 | 13 | ||
| mg | [GWF01a] | regression | 1,385 | 6 | ||
| mpg | UCI | regression | 392 | 7 | ||
| pyrim | UCI | regression | 74 | 27 | ||
| space_ga | StatLib | regression | 3,107 | 6 | ||
| triazines | UCI | regression | 186 | 60 | ||
| YearPredictionMSD | UCI | regression | 463,715 | 51,630 | 90 | |
| mediamill (exp1) | Mediamill | multi-label | 101 | 30,993 | 12,914 | 120 |
| rcv1v2 (topics; subsets) | [DL04b] | multi-label | 101 | 3,000 | 3,000 | 47,236 |
| rcv1v2 (topics; full sets) | [DL04b] | multi-label | 101 | 23,149 | 781,265 | 47,236 |
| rcv1v2 (industries; full sets) | [DL04b] | multi-label | 313 | 23,149 | 781,265 | 47,236 |
| rcv1v2 (regions; full sets) | [DL04b] | multi-label | 228 | 23,149 | 781,265 | 47,236 |
| scene-classification | [MB04a] | multi-label | 6 | 1,211 | 1,196 | 294 |
| siam-competition2007 | SIAM Text Mining Competition 2007 | multi-label | 22 | 21,519 | 7,077 | 30,438 |
| yeast | [AE02a] | multi-label | 14 | 1,500 | 917 | 103 |
We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.
Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.