LIBSVM Data: Classification (Binary Class)
This page contains many classification, regression, and
multi-label data sets stored in LIBSVM format. Many
are from UCI, Statlog, StatLib and other collections. We
thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
a1a
- Source:
UCI
/ Adult
- Preprocessing:
The original Adult data set has 14 features, among which six
are continuous and eight are categorical. In this data set,
continuous features are discretized into quantiles, and
each quantile is represented by a binary feature.
Also, a categorical feature with m categories is converted to m binary features.
Details on how each feature is converted can be found in the beginning of each file
from this page.
[JP98a]
- # of classes: 2
- # of data:
1,605
/ 30,956 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a2a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
2,265
/ 30,296 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a3a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
3,185
/ 29,376 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a4a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
4,781
/ 27,780 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a5a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
6,414
/ 26,147 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a6a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
11,220
/ 21,341 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a7a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
16,100
/ 16,461 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a8a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
22,696
/ 9,865 (testing)
- # of features:
123
/ 123 (testing)
- Files:
a9a
- Source:
UCI
/ Adult
- Preprocessing:
The same as a1a.
[JP98a]
- # of classes: 2
- # of data:
32,561
/ 16,281 (testing)
- # of features:
123
/ 123 (testing)
- Files:
australian
- Source:
Statlog
/ Australian
- # of classes: 2
- # of data:
690
- # of features:
14
- Files:
breast-cancer
- Source:
UCI
/ Wisconsin Breast Cancer
- Preprocessing:
Note that the original data has the column 1 containing sample ID. Also 16 instances with missing values are removed.
- # of classes: 2
- # of data:
683
- # of features:
10
- Files:
cod-rna
- Source:
[AVU06a]
- Features:
- Divide by 10 to get deltaG_total value computed by the Dynalign algorithm
- The length of shorter sequence
- 'A' frequencies of sequence 1
- 'U' frequencies of sequence 1
- 'C' frequencies of sequence 1
- 'A' frequencies of sequence 2
- 'U' frequencies of sequence 2
- 'C' frequencies of sequence 2
- # of classes: 2
- # of data:
59535
/ 271617 (validation)
/ 157413 (unused/remaining)
- # of features:
8
- Files:
colon-cancer
- Source:
[AU99a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
62
- # of features:
2,000
- Files:
covtype.binary
- Source:
UCI
/ Covertype
- Preprocessing:
Transform from multiclass into binary class.
[RC02a]
- # of classes: 2
- # of data:
581,012
- # of features:
54
- Files:
diabetes
- Source:
UCI
/ Pima Indians Diabetes
- # of classes: 2
- # of data:
768
- # of features:
8
- Files:
duke breast-cancer
- Source:
[MW01a]
- Preprocessing:
Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. The original dataset consists of 49 instances. Five are removed since the classification results using immunohistochemistry and protein immunoblotting assay confilcted. Of the remaining, two instances were rejected due to failed array hybridization. The rest data are further splited into training (38), and validation (4).
[SKS03a]
- # of classes: 2
- # of data:
44
- # of features:
7,129
- Files:
epsilon
- Source:
PASCAL Challenge 2008
- Preprocessing:
The raw data set (epsilon_train) is instance-wisely scaled to unit length and split into two parts: 4/5 for training and 1/5 for testing. The training part is feature-wisely normalizied to mean zero and variance one and then instance-wisely scaled to unit length. Using the scaling factors of the training part, the testing part is processed in a similar way. These train and testing data sets are used in
[GXY11a]
- # of classes: 2
- # of data:
400,000
/ 100,000 (testing)
- # of features:
2,000
- Files:
fourclass
- Source:
[TKH96a]
- Preprocessing:
transform to two-class
- # of classes: 2
- # of data:
862
- # of features:
2
- Files:
german.numer
- Source:
Statlog
/ German
- # of classes: 2
- # of data:
1,000
- # of features:
24
- Files:
gisette
- Source:
NIPS 2003 Feature Selection Challenge
[IG05a]
- Preprocessing:
The training data (gisette_train) are feature-wisely scaled to [-1,1]. Then the testing data (gisette_val) are scaled based on the same scaling factors for the training data. These two scaled data sets are used in
[GXY11a]
- # of classes: 2
- # of data:
6,000
/ 1,000 (testing)
- # of features:
5,000
- Files:
heart
- Source:
Statlog
/ Heart
- # of classes: 2
- # of data:
270
- # of features:
13
- Files:
ijcnn1
- Source:
[DP01a]
- Preprocessing:
We use winner's transformation
[Chang01d]
- # of classes: 2
- # of data:
49,990
/ 91,701 (testing)
- # of features:
22
- Files:
ionosphere
- Source:
UCI
/ Ionosphere
- # of classes: 2
- # of data:
351
- # of features:
34
- Files:
kdd2010 (algebra)
- Source:
KDD CUP 2010
- Preprocessing:
KDD Cup 2010 is an educational data mining competition. The data comes from Carnigie Learning and DataShop.
This is the training set of the first problem: algebra_2008_2009.
We provide a transformed version used by the winner (National Taiwan Univ). Because lables of
the competition's testing set are not available, the training
data is split to two sets for training and validation. The validation set is called the testing
set here.
To access the raw data set, please check the above "KDD CUP 2010" link.
This data set is only to be used for research purposes. Users please acknowledge the data is from Carnigie Learning and DataShop.
[HFY10c]
- # of classes: 2
- # of data:
8,407,752
/ 510,302 (testing)
- # of features:
20,216,830
/ 20,216,830 (testing)
- Files:
kdd2010 (bridge to algebra)
- Source:
KDD CUP 2010
- Preprocessing:
KDD Cup 2010 is an educational data mining competition. The data comes from Carnigie Learning and DataShop.
This is the training set of the second problem: bridge_to_algebra_2008_2009.
We provide a transformed version used by the winner (National Taiwan Univ). Because lables of
the competition's testing set are not available, the training
data is split to two sets for training and validation. The validation set is called the testing
set here.
To access the raw data set, please check the above "KDD CUP 2010" link.
This data set is only to be used for research purposes. Users please acknowledge the data is from Carnigie Learning and DataShop.
[HFY10c]
- # of classes: 2
- # of data:
19,264,097
/ 748,401 (testing)
- # of features:
29,890,095
/ 29,890,095 (testing)
- Files:
leukemia
- Source:
[TG99a]
- Preprocessing:
Merge training/testing. Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one.
[SKS03a]
- # of classes: 2
- # of data:
38
/ 34 (testing)
- # of features:
7129
- Files:
liver-disorders
- Source:
UCI
/ Liver-disorders
- # of classes: 2
- # of data:
345
- # of features:
6
- Files:
mushrooms
- Source:
UCI
/ mushrooms
- Preprocessing:
Each nominal attribute is expaned into several binary attributes. The original attribute #12 has missing values and is not used.
- # of classes: 2
- # of data:
8124
- # of features:
112
- Files:
news20.binary
- Source:
[SSK05a]
- Preprocessing:
Each instance has unit length.
- # of classes: 2
- # of data:
19,996
- # of features:
1,355,191
- Files:
rcv1.binary
- Source:
[DL04b]
- Preprocessing:
positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed.
- # of classes: 2
- # of data:
20,242
/ 677,399 (testing)
- # of features:
47,236
- Files:
real-sim
- Source:
A. McCallum
/ Real vs. Simulated
- Preprocessing:
Vikas Sindhwani for the SVMlin project
- # of classes: 2
- # of data:
72,309
- # of features:
20,958
- Files:
splice
- Source:
Delve
/ splice
- # of classes: 2
- # of data:
1,000
/ 2,175 (testing)
- # of features:
60
- Files:
sonar
- Source:
UCI
/ Undocumented/Sonar
- # of classes: 2
- # of data:
208
- # of features:
60
- Files:
svmguide1
- Source:
[CWH03a]
- Preprocessing:
Original data: an astroparticle application from Jan Conrad of Uppsala University, Sweden.
- # of classes: 2
- # of data:
3,089
/ 4,000 (testing)
- # of features:
4
- Files:
svmguide3
- Source:
[CWH03a]
- Preprocessing:
Original data: someone from Germany working with the car industry.
- # of classes: 2
- # of data:
1,243
/ 41 (testing)
- # of features:
21
- Files:
url
- Source:
[JM09a]
- Preprocessing:
The file "url_original.tar.bz2" contains a directory 121 days, in which the file "FeatureTypes" gives indices of real-valued features (other features are 0/1). The file "url_combined.bz2" combines all 121-day data into one file. See more details in this page.
- # of classes: 2
- # of data:
2,396,130
- # of features:
3,231,961
- Files:
w1a
- Source:
[JP98a]
- # of classes: 2
- # of data:
2,477
/ 47,272 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w2a
- Source:
[JP98a]
- # of classes: 2
- # of data:
3,470
/ 46,279 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w3a
- Source:
[JP98a]
- # of classes: 2
- # of data:
4,912
/ 44,837 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w4a
- Source:
[JP98a]
- # of classes: 2
- # of data:
7,366
/ 42,383 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w5a
- Source:
[JP98a]
- # of classes: 2
- # of data:
9,888
/ 39,861 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w6a
- Source:
[JP98a]
- # of classes: 2
- # of data:
17,188
/ 32,561 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w7a
- Source:
[JP98a]
- # of classes: 2
- # of data:
24,692
/ 25,057 (testing)
- # of features:
300
/ 300 (testing)
- Files:
w8a
- Source:
[JP98a]
- # of classes: 2
- # of data:
49,749
/ 14,951 (testing)
- # of features:
300
/ 300 (testing)
- Files:
webspam
- Source:
Webb Spam Corpus
[ST06a]
- Preprocessing:
We consider the subset used in the Pascal Large Scale Learning Challenge.
According to Soeren Sonnenburg,
all positive examples were taken and the negative examples were created
by randomly traversing the Internet starting at well known (e.g. news) web-sites.
We treat continuous n bytes as a word: trigram if n = 3 and unigram if n = 1.
We use word count as the feature value
and normalize each instance to unit length.
For unigram, the number of features is 254.
Contact us if you need scripts to obtain the
data from the original documents.
- # of classes: 2
- # of data:
350,000
- # of features:
16,609,143
- Files: