LIBSVM Data: Classification, Regression, and Multi-label

This page contains many classification, regression, and multi-label data sets stored in LIBSVM format. Many are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.

A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Please also cite the source of the data sets (references given below).

Go to pages of classification (binary, multi-class), regression, and multi-label.


namesourcetypeclasstraining sizetesting sizefeature
a1aUCIclassification21,60530,956123
a2aUCIclassification22,26530,296123
a3aUCIclassification23,18529,376123
a4aUCIclassification24,78127,780123
a5aUCIclassification26,41426,147123
a6aUCIclassification211,22021,341123
a7aUCIclassification216,10016,461123
a8aUCIclassification222,6969,865123
a9aUCIclassification232,56116,281123
australianStatlogclassification269014
breast-cancerUCIclassification268310
cod-rna [AVU06a] classification259,5358
colon-cancer [AU99a] classification2622,000
covtype.binaryUCIclassification2581,01254
diabetesUCIclassification27688
duke breast-cancer [MW01a] classification2447,129
epsilonPASCAL Challenge 2008classification2400,000100,0002,000
fourclass [TKH96a] classification28622
german.numerStatlogclassification21,00024
gisetteNIPS 2003 Feature Selection Challenge [IG05a] classification26,0001,0005,000
heartStatlogclassification227013
ijcnn1 [DP01a] classification249,99091,70122
ionosphereUCIclassification235134
kdd2010 (algebra)KDD CUP 2010classification28,407,752510,30220,216,830
kdd2010 (bridge to algebra)KDD CUP 2010classification219,264,097748,40129,890,095
leukemia [TG99a] classification238347129
liver-disordersUCIclassification23456
mushroomsUCIclassification28124112
news20.binary [SSK05a] classification219,9961,355,191
rcv1.binary [DL04b] classification220,242677,39947,236
real-simA. McCallumclassification272,30920,958
spliceDelveclassification21,0002,17560
splice-site [SS10a,AA12a] classification210,000,0004,627,84011,725,480
sonarUCIclassification220860
svmguide1 [CWH03a] classification23,0894,0004
svmguide3 [CWH03a] classification21,2434121
url [JM09a] classification22,396,1303,231,961
w1a [JP98a] classification22,47747,272300
w2a [JP98a] classification23,47046,279300
w3a [JP98a] classification24,91244,837300
w4a [JP98a] classification27,36642,383300
w5a [JP98a] classification29,88839,861300
w6a [JP98a] classification217,18832,561300
w7a [JP98a] classification224,69225,057300
w8a [JP98a] classification249,74914,951300
webspamWebb Spam Corpus [ST06a] classification2350,00016,609,143
aloialoi [AR14a] classification1,000108,000128
connect-4UCIclassification367,557126
covtypeUCIclassification7581,01254
dnaStatlogclassification32,0001,186180
glassUCIclassification62149
irisUCIclassification31504
letterStatlogclassification2615,0005,00016
mnist [YL98a] classification1060,00010,000780
mnist8mInvariant SVM [GL07b] classification108,100,000784
news20 [KL95a] classification2015,9353,99362,061
pendigitsUCIclassification107,4943,49816
pokerUCIclassification1025,0101,000,00010
protein [JYW02a] classification317,7666,621357
rcv1.multiclass [DL04b] classification5315,564518,57147,236
satimageStatlogclassification64,4352,00036
sector [AM98a] classification1056,4123,20755,197
segmentStatlogclassification72,31019
shuttleStatlogclassification743,50014,5009
svmguide2 [CWH03a] classification339120
svmguide4 [CWH03a] classification630031210
usps [JJH94a] classification107,2912,007256
SensIT Vehicle (acoustic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (seismic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (combined)Sensit [MD04a] classification378,82319,705100
vehicleStatlogclassification484618
vowelUCIclassification1152846210
wineUCIclassification317813
abaloneUCIregression4,1778
bodyfatStatLibregression25214
cadataStatLibregression20,6408
cpusmallDelveregression8,19212
E2006-log1p10-K Corpusregression16,0873,3084,272,227
E2006-tfidf10-K Corpusregression16,0873,308150,360
eunite2001regression3363116
housingUCIregression50613
mg [GWF01a] regression1,3856
mpgUCIregression3927
pyrimUCIregression7427
space_gaStatLibregression3,1076
triazinesUCIregression18660
YearPredictionMSDUCIregression463,71551,63090
mediamill (exp1)Mediamillmulti-label10130,99312,914120
rcv1v2 (topics; subsets) [DL04b] multi-label1013,0003,00047,236
rcv1v2 (topics; full sets) [DL04b] multi-label10123,149781,26547,236
rcv1v2 (industries; full sets) [DL04b] multi-label31323,149781,26547,236
rcv1v2 (regions; full sets) [DL04b] multi-label22823,149781,26547,236
scene-classification [MB04a] multi-label61,2111,196294
siam-competition2007SIAM Text Mining Competition 2007multi-label2221,5197,07730,438
yeast [AE02a] multi-label141,500917103

We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.

Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.