LIBSVM Data: Classification (Multi-class)
This page contains many classification, regression, and
multi-label data sets used in our papers. Many
are from UCI, Statlog, StatLib and other collections. We
really thank their efforts. For most sets, we directly transform the file
into LIBSVM format and linearly scale each attribute to [-1,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set.
connect-4
- Source:
UCI
/ Connect-4
- Preprocessing:
We used binary encoding for each feature (o, b, x), so the number of features is 42*3 = 126.
- # of classes: 3
- # of data:
67,557
- # of features:
126
- Files:
covtype
- Source:
UCI
/ Covertype
- # of classes: 7
- # of data:
581,012
- # of features:
54
- Files:
dna
- Source:
Statlog
/ Dna
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 3
- # of data:
2,000
/ 1,186 (testing)
/ 1,400 (tr)
/ 600 (val)
- # of features:
180
- Files:
glass
- Source:
UCI
/ Glass Identification
- # of classes: 6
- # of data:
214
- # of features:
9
- Files:
iris
- Source:
UCI
/ Iris Plant
- # of classes: 3
- # of data:
150
- # of features:
4
- Files:
letter
- Source:
Statlog
/ Letter
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 26
- # of data:
15,000
/ 5,000 (testing)
/ 10,500 (tr)
/ 4,500 (val)
- # of features:
16
- Files:
mnist
- Source:
[YL98a]
- # of classes: 10
- # of data:
60,000
/ 10,000 (testing)
- # of features:
780
/ 778 (testing)
- Files:
news20
- Source:
[KL95a]
- Preprocessing:
First 80/20 training/testing split.Also see
this page
[JR01a]
- # of classes: 20
- # of data:
15,935
/ 3,993 (testing)
- # of features:
62,061
/ 62,060 (testing)
- Files:
protein
- Source:
[JYW02a]
- # of classes: 3
- # of data:
17,766
/ 6,621 (testing)
/ 14,895 (tr)
/ 2,871 (val)
- # of features:
357
- Files:
satimage
- Source:
Statlog
/ Satimage
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 6
- # of data:
4,435
/ 2,000 (testing)
/ 3,104 (tr)
/ 1,331 (val)
- # of features:
36
- Files:
sector
- Source:
[AM98a]
- Preprocessing:
A 2/1 training/testing split from this page (train1.* and test1.*).
Scaling:
a tf-idf scheme ln(1+tf)*log_2(#docs/#coll_freq_of_term) using all data, then normalize to unit length
[JR01b]
- # of classes: 105
- # of data:
6,412
/ 3,207 (testing)
- # of features:
55,197
/ 55,197 (testing)
- Files:
segment
- Source:
Statlog
/ Segment
- # of classes: 7
- # of data:
2,310
- # of features:
19
- Files:
shuttle
- Source:
Statlog
/ Shuttle
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 7
- # of data:
43,500
/ 14,500 (testing)
/ 30,450 (tr)
/ 13,050 (val)
- # of features:
9
- Files:
svmguide2
- Source:
[CWH03a]
- Preprocessing:
Original data: a bioinformatics application from Simon Fraser University, Canada.
[JLG03a]
- # of classes: 3
- # of data:
391
- # of features:
20
- Files:
usps
- Source:
[JJH94a]
- # of classes: 10
- # of data:
7,291
/ 2,007 (testing)
- # of features:
256
- Files:
SensIT Vehicle (acoustic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate as in
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (seismic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate as in
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (combined)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate as in
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
100
- Files:
vehicle
- Source:
Statlog
/ Vehicle
- # of classes: 4
- # of data:
846
- # of features:
18
- Files:
vowel
- Source:
UCI
/ Vowel
- # of classes: 11
- # of data:
528
/ 462 (testing)
- # of features:
10
- Files:
wine
- Source:
UCI
/ Wine Recognition
- # of classes: 3
- # of data:
178
- # of features:
13
- Files: