LIBSVM Data: Classification, Regression, and Multi-label

This page contains many classification, regression, and multi-label data sets used in our papers. Many are from UCI, Statlog, StatLib and other collections. We really thank their efforts. For most sets, we directly transform the file into LIBSVM format and linearly scale each attribute to [-1,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set.

A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.


Please also cite the source of the data sets (references given below).

Go to pages of classification (binary, multi-class), regression, and multi-label.


namesourcetypeclasstraining sizetesting sizefeature
a1aUCIclassification21,60530,956123
a2aUCIclassification22,26530,296123
a3aUCIclassification23,18529,376123
a4aUCIclassification24,78127,780123
a5aUCIclassification26,41426,147123
a6aUCIclassification211,22021,341123
a7aUCIclassification216,10016,461123
a8aUCIclassification222,6969,865123
a9aUCIclassification232,56116,281123
australianStatlogclassification269014
breast-cancerUCIclassification268310
colon-cancer [AU99a] classification2622,000
covtype.binaryUCIclassification2581,01254
diabetesUCIclassification27688
duke breast-cancer [MW01a] classification2447,129
fourclass [TKH96a] classification28622
german.numerStatlogclassification21,00024
heartStatlogclassification227013
ijcnn1 [DP01a] classification249,99091,70122
ionosphereUCIclassification235134
leukemia [TG99a] classification238347129
liver-disordersUCIclassification23456
mushroomsUCIclassification28124112
news20.binary [SSK05a] classification219,9961,355,191
rcv1.binary [DL04b] classification220,242677,39947,236
real-simA. McCallumclassification272,30920,958
spliceDelveclassification21,0002,17560
sonarUCIclassification220860
svmguide1 [CWH03a] classification23,0894,0004
svmguide3 [CWH03a] classification21,2434121
w1a [JP98a] classification22,47747,272300
w2a [JP98a] classification23,47046,279300
w3a [JP98a] classification24,91244,837300
w4a [JP98a] classification27,36642,383300
w5a [JP98a] classification29,88839,861300
w6a [JP98a] classification217,18832,561300
w7a [JP98a] classification224,69225,057300
w8a [JP98a] classification249,74914,951300
connect-4UCIclassification367,557126
covtypeUCIclassification7581,01254
dnaStatlogclassification32,0001,186180
glassUCIclassification62149
irisUCIclassification31504
letterStatlogclassification2615,0005,00016
mnist [YL98a] classification1060,00010,000780
mnist1 [YL98a] classification1021,00049,000780
news20 [KL95a] classification2015,9353,99362,061
protein [JYW02a] classification317,7666,621357
satimageStatlogclassification64,4352,00036
sector [AM98a] classification1056,4123,20755,197
segmentStatlogclassification72,31019
shuttleStatlogclassification743,50014,5009
svmguide2 [CWH03a] classification339120
usps [JJH94a] classification107,2912,007256
SensIT Vehicle (acoustic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (seismic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (combined)Sensit [MD04a] classification378,82319,705100
vehicleStatlogclassification484618
vowelUCIclassification1152846210
wineUCIclassification317813
abaloneUCIregression4,1778
bodyfatStatLibregression25214
cadataStatLibregression20,6408
cpusmallDelveregression8,19212
housingUCIregression50613
mg [GWF01a] regression1,3856
mpgUCIregression3927
pyrimUCIregression7427
space_gaStatLibregression3,1076
triazinesUCIregression18660
mediamill (exp1)Mediamillmulti-label10130,99312,914120
rcv1v2 (topics; subsets) [DL04b] multi-label1013,0003,00047,236
rcv1v2 (topics; full sets) [DL04b] multi-label10123,149781,26547,236
rcv1v2 (industries; full sets) [DL04b] multi-label31323,149781,26547,236
rcv1v2 (regions; full sets) [DL04b] multi-label22823,149781,26547,236
scene-classification [MB04a] multi-label61,2111,196294
siam-competition2007SIAM Text Mining Competition 2007multi-label2221,5197,07730,438
yeast [AE02a] multi-label141,500917103

We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.

Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.