LIBSVM Data: Classification (Binary Class)
This page contains many classification, regression, and
multi-label data sets used in our papers. Many
are from UCI, Statlog, StatLib and other collections. We
really thank their efforts. For most sets, we directly transform the file
into LIBSVM format and linearly scale each attribute to [-1,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set.
a1a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
1,605
/ 30,956 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a2a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
2,265
/ 30,296 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a3a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
3,185
/ 29,376 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a4a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
4,781
/ 27,780 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a5a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
6,414
/ 26,147 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a6a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
11,220
/ 21,341 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a7a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
16,100
/ 16,461 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a8a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
22,696
/ 9,865 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a9a
- Source:
UCI
/ Adult
- Preprocessing:
[JP98a]
- # of classes: 2
- # of data:
32,561
/ 16,281 (testing)
- # of features:
123
/ 123 (testing)
- Files:
australian
- Source:
Statlog
/ Australian
- # of classes: 2
- # of data:
690
- # of features:
14
- Files:
breast-cancer
- Source:
UCI
/ Wisconsin Breast Cancer
- Preprocessing:
Note that the original data has the column 1 containing sample ID. Also 16 instances with missing values are removed.
- # of classes: 2
- # of data:
683
- # of features:
10
- Files:
colon-cancer
- Source:
[AU99a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
62
- # of features:
2,000
- Files:
covtype.binary
- Source:
UCI
/ Covertype
- Preprocessing:
Transform from multiclass into binary class.
[RC02a]
- # of classes: 2
- # of data:
581,012
- # of features:
54
- Files:
diabetes
- Source:
UCI
/ Pima Indians Diabetes
- # of classes: 2
- # of data:
768
- # of features:
8
- Files:
duke breast-cancer
- Source:
[MW01a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. The original dataset consists of 49 instances. Five are removed since the classification results using immunohistochemistry and protein immunoblotting assay confilcted. Of the remaining, two instances were rejected due to failed array hybridization. The rest data are further splited into training (38), and validation (4).
[SKS03a]
- # of classes: 2
- # of data:
44
- # of features:
7,129
- Files:
fourclass
- Source:
[TKH96a]
- Preprocessing:
transform to two-class
- # of classes: 2
- # of data:
862
- # of features:
2
- Files:
german.numer
- Source:
Statlog
/ German
- # of classes: 2
- # of data:
1,000
- # of features:
24
- Files:
heart
- Source:
Statlog
/ Heart
- # of classes: 2
- # of data:
270
- # of features:
13
- Files:
ijcnn1
- Source:
[DP01a]
- Preprocessing:
We use winner's transformation
[Chang01d]
- # of classes: 2
- # of data:
49,990
/ 91,701 (testing)
- # of features:
22
- Files:
ionosphere
- Source:
UCI
/ Ionosphere
- # of classes: 2
- # of data:
351
- # of features:
34
- Files:
leukemia
- Source:
[TG99a]
- Preprocessing:
Merge training/testing. Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
38
/ 34 (testing)
- # of features:
7129
- Files:
liver-disorders
- Source:
UCI
/ Liver-disorders
- # of classes: 2
- # of data:
345
- # of features:
6
- Files:
mushrooms
- Source:
UCI
/ mushrooms
- Preprocessing:
Each nominal attribute is expaned into several binary attributes. The original attribute #12 has missing values and is not used.
- # of classes: 2
- # of data:
8124
- # of features:
112
- Files:
news20.binary
- Source:
[SSK05a]
- Preprocessing:
Each instance has unit length.
- # of classes: 2
- # of data:
19,996
- # of features:
1,355,191
- Files:
rcv1.binary
- Source:
[DL04b]
- Preprocessing:
positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed.
- # of classes: 2
- # of data:
20,242
/ 677,399 (testing)
- # of features:
47,236
- Files:
real-sim
- Source:
A. McCallum
/ Real vs. Simulated
- Preprocessing:
Vikas Sindhwani for the SVMlin project
- # of classes: 2
- # of data:
72,309
- # of features:
20,958
- Files:
splice
- Source:
Delve
/ splice
- # of classes: 2
- # of data:
1,000
/ 2,175 (testing)
- # of features:
60
- Files:
sonar
- Source:
UCI
/ Undocumented/Sonar
- # of classes: 2
- # of data:
208
- # of features:
60
- Files:
svmguide1
- Source:
[CWH03a]
- Preprocessing:
Original data: an astroparticle application from Jan Conrad of Uppsala University, Sweden.
- # of classes: 2
- # of data:
3,089
/ 4,000 (testing)
- # of features:
4
- Files:
svmguide3
- Source:
[CWH03a]
- Preprocessing:
Original data: someone from Germany working with the car industry.
- # of classes: 2
- # of data:
1,243
/ 41 (testing)
- # of features:
21
- Files:
w1a
- Source:
[JP98a]
- # of classes: 2
- # of data:
2,477
/ 47,272 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w2a
- Source:
[JP98a]
- # of classes: 2
- # of data:
3,470
/ 46,279 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w3a
- Source:
[JP98a]
- # of classes: 2
- # of data:
4,912
/ 44,837 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w4a
- Source:
[JP98a]
- # of classes: 2
- # of data:
7,366
/ 42,383 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w5a
- Source:
[JP98a]
- # of classes: 2
- # of data:
9,888
/ 39,861 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w6a
- Source:
[JP98a]
- # of classes: 2
- # of data:
17,188
/ 32,561 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w7a
- Source:
[JP98a]
- # of classes: 2
- # of data:
24,692
/ 25,057 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w8a
- Source:
[JP98a]
- # of classes: 2
- # of data:
49,749
/ 14,951 (testing)
- # of features:
300
/ 300 (testing)
- Files: