LIBSVM Data: Classification (Binary Class)
This page contains many classification, regression, and
multi-label data sets used in our papers. Many
are from UCI, Statlog, StatLib and other collections. We
really thank their efforts. For most sets, we directly transform the file
into LIBSVM format and linearly scale each attribute to [-1,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set.
a1a
- Source:
UCI
/ Adult
- Preprocessing:
The original Adult data set has 14 features, among which six
are continuous and eight are categorical. In this data set,
continuous features are discretized into quantiles, and
each quantile is represented by a binary feature.
Also, a categorical feature with m categories is converted to m binary features.
Details on how each feature is converted can be found in the beginning of each file
from this page.
[JP98a]
- # of classes: 2
- # of data:
1,605
/ 30,956 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a2a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
2,265
/ 30,296 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a3a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
3,185
/ 29,376 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a4a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
4,781
/ 27,780 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a5a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
6,414
/ 26,147 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a6a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
11,220
/ 21,341 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a7a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
16,100
/ 16,461 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a8a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
22,696
/ 9,865 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a9a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
32,561
/ 16,281 (testing)
- # of features:
123
/ 123 (testing)
- Files:
australian
- Source:
Statlog
/ Australian
- # of classes: 2
- # of data:
690
- # of features:
14
- Files:
breast-cancer
- Source:
UCI
/ Wisconsin Breast Cancer
- Preprocessing:
Note that the original data has the column 1 containing sample ID. Also 16 instances with missing values are removed.
- # of classes: 2
- # of data:
683
- # of features:
10
- Files:
cod-rna
- Source:
[AVU06a]
- Features:
- Divide by 10 to get deltaG_total value computed by the Dynalign algorithm
- The length of shorter sequence
- 'A' frequencies of sequence 1
- 'U' frequencies of sequence 1
- 'C' frequencies of sequence 1
- 'A' frequencies of sequence 2
- 'U' frequencies of sequence 2
- 'C' frequencies of sequence 2
- # of classes: 2
- # of data:
59535
/ 271617 (validation)
/ 157413 (unused/remaining)
- # of features:
8
- Files:
colon-cancer
- Source:
[AU99a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
62
- # of features:
2,000
- Files:
covtype.binary
- Source:
UCI
/ Covertype
- Preprocessing:
Transform from multiclass into binary class.
[RC02a]
- # of classes: 2
- # of data:
581,012
- # of features:
54
- Files:
diabetes
- Source:
UCI
/ Pima Indians Diabetes
- # of classes: 2
- # of data:
768
- # of features:
8
- Files:
duke breast-cancer
- Source:
[MW01a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. The original dataset consists of 49 instances. Five are removed since the classification results using immunohistochemistry and protein immunoblotting assay confilcted. Of the remaining, two instances were rejected due to failed array hybridization. The rest data are further splited into training (38), and validation (4).
[SKS03a]
- # of classes: 2
- # of data:
44
- # of features:
7,129
- Files:
fourclass
- Source:
[TKH96a]
- Preprocessing:
transform to two-class
- # of classes: 2
- # of data:
862
- # of features:
2
- Files:
german.numer
- Source:
Statlog
/ German
- # of classes: 2
- # of data:
1,000
- # of features:
24
- Files:
heart
- Source:
Statlog
/ Heart
- # of classes: 2
- # of data:
270
- # of features:
13
- Files:
ijcnn1
- Source:
[DP01a]
- Preprocessing:
We use winner's transformation
[Chang01d]
- # of classes: 2
- # of data:
49,990
/ 91,701 (testing)
- # of features:
22
- Files:
ionosphere
- Source:
UCI
/ Ionosphere
- # of classes: 2
- # of data:
351
- # of features:
34
- Files:
leukemia
- Source:
[TG99a]
- Preprocessing:
Merge training/testing. Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
38
/ 34 (testing)
- # of features:
7129
- Files:
liver-disorders
- Source:
UCI
/ Liver-disorders
- # of classes: 2
- # of data:
345
- # of features:
6
- Files:
mushrooms
- Source:
UCI
/ mushrooms
- Preprocessing:
Each nominal attribute is expaned into several binary attributes. The original attribute #12 has missing values and is not used.
- # of classes: 2
- # of data:
8124
- # of features:
112
- Files:
news20.binary
- Source:
[SSK05a]
- Preprocessing:
Each instance has unit length.
- # of classes: 2
- # of data:
19,996
- # of features:
1,355,191
- Files:
rcv1.binary
- Source:
[DL04b]
- Preprocessing:
positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed.
- # of classes: 2
- # of data:
20,242
/ 677,399 (testing)
- # of features:
47,236
- Files:
real-sim
- Source:
A. McCallum
/ Real vs. Simulated
- Preprocessing:
Vikas Sindhwani for the SVMlin project
- # of classes: 2
- # of data:
72,309
- # of features:
20,958
- Files:
splice
- Source:
Delve
/ splice
- # of classes: 2
- # of data:
1,000
/ 2,175 (testing)
- # of features:
60
- Files:
sonar
- Source:
UCI
/ Undocumented/Sonar
- # of classes: 2
- # of data:
208
- # of features:
60
- Files:
svmguide1
- Source:
[CWH03a]
- Preprocessing:
Original data: an astroparticle application from Jan Conrad of Uppsala University, Sweden.
- # of classes: 2
- # of data:
3,089
/ 4,000 (testing)
- # of features:
4
- Files:
svmguide3
- Source:
[CWH03a]
- Preprocessing:
Original data: someone from Germany working with the car industry.
- # of classes: 2
- # of data:
1,243
/ 41 (testing)
- # of features:
21
- Files:
w1a
- Source:
[JP98a]
- # of classes: 2
- # of data:
2,477
/ 47,272 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w2a
- Source:
[JP98a]
- # of classes: 2
- # of data:
3,470
/ 46,279 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w3a
- Source:
[JP98a]
- # of classes: 2
- # of data:
4,912
/ 44,837 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w4a
- Source:
[JP98a]
- # of classes: 2
- # of data:
7,366
/ 42,383 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w5a
- Source:
[JP98a]
- # of classes: 2
- # of data:
9,888
/ 39,861 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w6a
- Source:
[JP98a]
- # of classes: 2
- # of data:
17,188
/ 32,561 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w7a
- Source:
[JP98a]
- # of classes: 2
- # of data:
24,692
/ 25,057 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w8a
- Source:
[JP98a]
- # of classes: 2
- # of data:
49,749
/ 14,951 (testing)
- # of features:
300
/ 300 (testing)
- Files:
webspam
- Source:
Webb Spam Corpus
[ST06a]
- Preprocessing:
We consider the subset used in the Pascal Large Scale Learning Challenge.
According to Soeren Sonnenburg,
all positive examples were taken and the negative examples were created
by randomly traversing the Internet starting at well known (e.g. news) web-sites.
We treat continuous n bytes as a word: trigram if n = 3 and unigram if n = 1.
We use word count as the feature value
and normalize each instance to unit length.
For unigram, the number of features is 254.
Contact us if you need scripts to obtain the
data from the original documents.
- # of classes: 2
- # of data:
350,000
- # of features:
16,609,143
- Files: