LIBSVM Data: Classification (Multi-class)
This page contains many classification, regression, and
multi-label data sets stored in LIBSVM format. Many
are from UCI, Statlog, StatLib and other collections. We
thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
connect-4
- Source:
UCI
/ Connect-4
- Preprocessing:
We used binary encoding for each feature (o, b, x), so the number of features is 42*3 = 126.
- # of classes: 3
- # of data:
67,557
- # of features:
126
- Files:
covtype
- Source:
UCI
/ Covertype
- # of classes: 7
- # of data:
581,012
- # of features:
54
- Files:
dna
- Source:
Statlog
/ Dna
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 3
- # of data:
2,000
/ 1,186 (testing)
/ 1,400 (tr)
/ 600 (val)
- # of features:
180
- Files:
glass
- Source:
UCI
/ Glass Identification
- # of classes: 6
- # of data:
214
- # of features:
9
- Files:
iris
- Source:
UCI
/ Iris Plant
- # of classes: 3
- # of data:
150
- # of features:
4
- Files:
letter
- Source:
Statlog
/ Letter
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 26
- # of data:
15,000
/ 5,000 (testing)
/ 10,500 (tr)
/ 4,500 (val)
- # of features:
16
- Files:
mnist
- Source:
[YL98a]
- # of classes: 10
- # of data:
60,000
/ 10,000 (testing)
- # of features:
780
/ 778 (testing)
- Files:
mnist8m
- Source:
Invariant SVM
[GL07b]
- # of classes: 10
- # of data:
8,100,000
- # of features:
784
- Files:
news20
- Source:
[KL95a]
- Preprocessing:
First 80/20 training/testing split. Also see
this page
[JR01a]
- # of classes: 20
- # of data:
15,935
/ 3,993 (testing)
- # of features:
62,061
/ 62,060 (testing)
- Files:
pendigits
- Source:
UCI
/ Pen-Based Recognition of Handwritten Digits Data Set
- # of classes: 10
- # of data:
7,494
/ 3,498 (testing)
- # of features:
16
- Files:
poker
- Source:
UCI
/ Poker Hand
- # of classes: 10
- # of data:
25,010
/ 1,000,000 (testing)
- # of features:
10
- Files:
protein
- Source:
[JYW02a]
- # of classes: 3
- # of data:
17,766
/ 6,621 (testing)
/ 14,895 (tr)
/ 2,871 (val)
- # of features:
357
- Files:
rcv1.multiclass
- Source:
[DL04b]
- Preprocessing:
First, label hierarchy is reorganized by mapping the data set to the second level of RCV1 topic hierarchy. The documents that have labels of the third or forth level only are mapped to their parent category of the second level. The documents that only have labels of the first level are not mapped onto any category. Second, we remove multi-labelled instances.
[RB08a]
- # of classes: 53
- # of data:
15,564
/ 518,571 (testing)
- # of features:
47,236
- Files:
satimage
- Source:
Statlog
/ Satimage
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 6
- # of data:
4,435
/ 2,000 (testing)
/ 3,104 (tr)
/ 1,331 (val)
- # of features:
36
- Files:
sector
- Source:
[AM98a]
- Preprocessing:
The scaled data was used in our KDD 08 paper.
For unknown reason we could now only generate
something close to it. The sources are
from
this page.
We select train-0.tc and test-0.tc from
ecoc-svm-data.tar.gz.
A 2/1 training/testing split gives
training and testing sets below. They
are in the original
format instead of the libsvm format: in each
row the 2nd value
gives the class label and subsequent numbers give
pairs of feature IDs and values.
We then do a kind of
tf-idf transformation: ln(1+tf)*log_2(#docs/#coll_freq_of_term) and normalize each instance to unit length.
[JR01b,SSK08a]
- # of classes: 105
- # of data:
6,412
/ 3,207 (testing)
- # of features:
55,197
/ 55,197 (testing)
- Files:
segment
- Source:
Statlog
/ Segment
- # of classes: 7
- # of data:
2,310
- # of features:
19
- Files:
shuttle
- Source:
Statlog
/ Shuttle
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 7
- # of data:
43,500
/ 14,500 (testing)
/ 30,450 (tr)
/ 13,050 (val)
- # of features:
9
- Files:
svmguide2
- Source:
[CWH03a]
- Preprocessing:
Original data: a bioinformatics application from Simon Fraser University, Canada.
[JLG03a]
- # of classes: 3
- # of data:
391
- # of features:
20
- Files:
svmguide4
- Source:
[CWH03a]
- Preprocessing:
Original data:
an application on traffic light
signals from
Georges Bonga at
University of Applied Sciences, Berlin.
- # of classes: 6
- # of data:
300
/ 312 (testing)
- # of features:
10
- Files:
usps
- Source:
[JJH94a]
- # of classes: 10
- # of data:
7,291
/ 2,007 (testing)
- # of features:
256
- Files:
SensIT Vehicle (acoustic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (seismic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (combined)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
100
- Files:
vehicle
- Source:
Statlog
/ Vehicle
- # of classes: 4
- # of data:
846
- # of features:
18
- Files:
vowel
- Source:
UCI
/ Vowel
- Preprocessing:
First 528 instances are used as training and the remaining instances are for testing. Scaling training data first and adjust testing data accordingly.
- # of classes: 11
- # of data:
528
/ 462 (testing)
- # of features:
10
- Files:
wine
- Source:
UCI
/ Wine Recognition
- # of classes: 3
- # of data:
178
- # of features:
13
- Files: