LIBSVM Data: Classification, Regression, and Multi-label

This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.

A summary of all data sets is in the following. If you have used LIBSVM with these sets, and find them useful, please cite our work as:
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Please also cite the source of the data sets (references given below).

Go to pages of classification (binary, multi-class), regression, multi-label, and string. Those interested in hierarchical data with many classes can visit LSHTC page.

Some sets are large and the connection may fail. On Linux you can use

> wget -t inf URL_address_of_data
to retry infinitely many times. If it still fails, add -c to continuely get a partially-downloaded set. You can also use
> lftp -c 'pget -c URL_address_of_data'
to have several connections for reducing the downloading time.


namesourcetypeclasstraining sizetesting sizefeature
a1aUCIclassification21,60530,956123
a2aUCIclassification22,26530,296123
a3aUCIclassification23,18529,376123
a4aUCIclassification24,78127,780123
a5aUCIclassification26,41426,147123
a6aUCIclassification211,22021,341123
a7aUCIclassification216,10016,461123
a8aUCIclassification222,6969,865123
a9aUCIclassification232,56116,281123
australianStatlogclassification269014
avazuAvazu's Click-through Predictionclassification240,428,9674,577,4641,000,000
breast-cancerUCIclassification268310
cod-rna [AVU06a] classification259,5358
colon-cancer [AU99a] classification2622,000
covtype.binaryUCIclassification2581,01254
criteoCriteo's Display Advertising Challengeclassification245,840,6176,042,1351,000,000
criteo_tbCriteo's Terabyte Click Logsclassification24,195,197,692178,274,6371,000,000
diabetesUCIclassification27688
duke breast-cancer [MW01a] classification2447,129
epsilonPASCAL Challenge 2008classification2400,000100,0002,000
fourclass [TKH96a] classification28622
german.numerStatlogclassification21,00024
gisetteNIPS 2003 Feature Selection Challenge [IG05a] classification26,0001,0005,000
heartStatlogclassification227013
HIGGSUCIclassification211,000,00028
Hyperpartisan News Detection SemEval-2019 Task 4: Hyperpartisan News Detection classification251665
ijcnn1 [DP01a] classification249,99091,70122
imdb-sentiment Learning Word Vectors for Sentiment Analysis classification225,00025,000
ionosphereUCIclassification235134
kdd2010 (algebra)KDD CUP 2010classification28,407,752510,30220,216,830
kdd2010 (bridge to algebra)KDD CUP 2010classification219,264,097748,40129,890,095
kdd2010 raw version (bridge to algebra)KDD CUP 2010classification219,264,097748,4011,163,024
kdd2012KDD CUP 2012classification2149,639,10554,686,452
leukemia [TG99a] classification238347129
liver-disordersUCIclassification21452005
madelonNIPS 2003 Feature Selection Challenge [IG05a] classification22,000600500
mushroomsUCIclassification28124112
news20.binary [SSK05a] classification219,9961,355,191
phishingUCIclassification211,05568
rcv1.binary [DL04b] classification220,242677,39947,236
real-simA. McCallumclassification272,30920,958
skin_nonskinUCIclassification2245,0573
spliceDelveclassification21,0002,17560
splice-site [SS10a,AA12a] classification250,000,0004,627,84011,725,480
sonarUCIclassification220860
SUSYUCIclassification25,000,00018
svmguide1 [CWH03a] classification23,0894,0004
svmguide3 [CWH03a] classification21,2434121
url [JM09a] classification22,396,1303,231,961
w1a [JP98a] classification22,47747,272300
w2a [JP98a] classification23,47046,279300
w3a [JP98a] classification24,91244,837300
w4a [JP98a] classification27,36642,383300
w5a [JP98a] classification29,88839,861300
w6a [JP98a] classification217,18832,561300
w7a [JP98a] classification224,69225,057300
w8a [JP98a] classification249,74914,951300
webspamWebb Spam Corpus [ST06a] classification2350,00016,609,143
aloialoi [AR14a] classification1,000108,000128
cifar10The CIFAR-10 dataset [AK09a] classification1050,00010,0003,072
connect-4UCIclassification367,557126
covtypeUCIclassification7581,01254
dnaStatlogclassification32,0001,186180
glassUCIclassification62149
imdb-rating Jointly Modelling Aspects, Ratings and Sentiments for Movie Recommendation classification10348,415
irisUCIclassification31504
LEDGAR (LexGLUE) [IC22b] classification10060,00010,00019,996
letterStatlogclassification2615,0005,00016
mnist [YL98a] classification1060,00010,000780
mnist8mInvariant SVM [GL07b] classification108,100,000784
news20 [KL95a] classification2015,9353,99362,061
news20 (18,846) [KL95a] classification209,0517,532130,107
pendigitsUCIclassification107,4943,49816
pokerUCIclassification1025,0101,000,00010
protein [JYW02a] classification317,7666,621357
rcv1.multiclass [DL04b] classification5315,564518,57147,236
SCOTUS (LexGLUE) [IC22b] classification135,0001,400126,405
satimageStatlogclassification64,4352,00036
sector [AM98a] classification1056,4123,20755,197
segmentStatlogclassification72,31019
SensorlessUCIclassification1158,50948
shuttleStatlogclassification743,50014,5009
smallNORBThe Small NORB Dataset [YL04b] classification524,30024,30018,432
SVHNSVHN [YN11a] classification1073,25726,0323,072
svmguide2 [CWH03a] classification339120
svmguide4 [CWH03a] classification630031210
usps [JJH94a] classification107,2912,007256
SensIT Vehicle (acoustic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (seismic)Sensit [MD04a] classification378,82319,70550
SensIT Vehicle (combined)Sensit [MD04a] classification378,82319,705100
vehicleStatlogclassification484618
vowelUCIclassification1152846210
wineUCIclassification317813
abaloneUCIregression4,1778
bodyfatStatLibregression25214
cadataStatLibregression20,6408
cpusmallDelveregression8,19212
E2006-log1p10-K Corpusregression16,0873,3084,272,227
E2006-tfidf10-K Corpusregression16,0873,308150,360
eunite2001regression3363116
housingUCIregression50613
mg [GWF01a] regression1,3856
mpgUCIregression3927
pyrimUCIregression7427
space_gaStatLibregression3,1076
triazinesUCIregression18660
YearPredictionMSDUCIregression463,71551,63090
Amazon-670K [JM13a] multi-label670,091490,449153,025135,909(ver1) 135,909(ver2)
AmazonCat-13K [JM13a] multi-label13,3301,186,239306,782203,882 (ver1) 1,293,747 (ver2) 203,882(ver3)
bibtex [GT08a] multi-label1597,3951,836
BlogCatalog [LT09a] multi-label3910,312128
delicious [GT08a] multi-label98316,105500
ECtHR (A) (LexGLUE) [IC22b] multi-label109,0001,00092,401
ECtHR (B) (LexGLUE) [IC22b] multi-label109,0001,00092,401
EUR-LEX (LexGLUE) [IC22b] multi-label10055,0005,000147,464
EUR-Lex [LM10a] multi-label3,95615,4493,865186,104
EURLEX57K [IC19a] multi-label4,27145,0006,000N/A
Flickr [LT09a] multi-label19580,513128
mediamill (exp1)Mediamillmulti-label10130,99312,914120
PPI [WLH17a] multi-label12154,958128
rcv1v2 (topics; subsets) [DL04b] multi-label1013,0003,00047,236
rcv1v2 (topics; full sets) [DL04b] multi-label10123,149781,26547,236
rcv1v2 (industries; full sets) [DL04b] multi-label31323,149781,26547,236
rcv1v2 (regions; full sets) [DL04b] multi-label22823,149781,26547,236
scene-classification [MB04a] multi-label61,2111,196294
siam-competition2007SIAM Text Mining Competition 2007multi-label2221,5197,07730,438
UNFAIR-ToS (LexGLUE) [IC22b] multi-label85,5321,6076,290
Wiki10-31K [AZ09a] multi-label30,93814,1466,616104,374
yeast [AE02a] multi-label141,500917103
mnist (string format) [SL96a] string60,00010,000string
YouTube [LT09a] multi-label4631,703128

We have tried the best to obtain the permission from most original sources for distributing these sets. Please follow their respective copyrights for using them.

Author: Rong-En Fan at National Taiwan University. Please contact Chih-Jen Lin for any question.