LIBSVM Data: Classification (Binary Class)

This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.

a1a

Source: UCI / Adult
Preprocessing: The original Adult data set has 14 features, among which six are continuous and eight are categorical. In this data set, continuous features are discretized into quantiles, and each quantile is represented by a binary feature. Also, a categorical feature with m categories is converted to m binary features. Details on how each feature is converted can be found in the beginning of each file from this page. [JP98a]
# of classes: 2
# of data: 1,605 / 30,956 (testing)
# of features: 123 / 123 (testing)
Files:
- a1a
- a1a.t (testing)

a2a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 2,265 / 30,296 (testing)
# of features: 123 / 123 (testing)
Files:
- a2a
- a2a.t (testing)

a3a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 3,185 / 29,376 (testing)
# of features: 123 / 123 (testing)
Files:
- a3a
- a3a.t (testing)

a4a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 4,781 / 27,780 (testing)
# of features: 123 / 123 (testing)
Files:
- a4a
- a4a.t (testing)

a5a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 6,414 / 26,147 (testing)
# of features: 123 / 123 (testing)
Files:
- a5a
- a5a.t (testing)

a6a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 11,220 / 21,341 (testing)
# of features: 123 / 123 (testing)
Files:
- a6a
- a6a.t (testing)

a7a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 16,100 / 16,461 (testing)
# of features: 123 / 123 (testing)
Files:
- a7a
- a7a.t (testing)

a8a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 22,696 / 9,865 (testing)
# of features: 123 / 123 (testing)
Files:
- a8a
- a8a.t (testing)

a9a

Source: UCI / Adult
Preprocessing: The same as a1a. [JP98a]
# of classes: 2
# of data: 32,561 / 16,281 (testing)
# of features: 123 / 123 (testing)
Files:
- a9a
- a9a.t (testing)

australian

Source: Statlog / Australian
# of classes: 2
# of data: 690
# of features: 14
Files:
- australian
- australian_scale (scaled to [-1,1])

avazu

Source: Avazu's Click-through Prediction
Preprocessing: This data is used in a competition on click-through rate prediction jointly hosted by Avazu and Kaggle in 2014. The participants were asked to learn a model from the first 10 days of advertising log, and predict the click probability for the impressions on the 11th day. The data sets here are generated by applying our winning solution without some complicated components. To reproduce this data, you can execute our code and see the results in the directory "base." For better test scores, we divide the data to two disjoint groups "app" and "site," and conduct training and prediction tasks on the two groups independently. Specifically, each instance has either "site_id=85f751fd" or "app_id=ecad2386," and these two feature values never co-occur. Thus we can split the data set according to them. The organizers do not disclose the test labels, so the labels in the test sets are not meaningful. To obtain a test score, please use the code provided below to generate and submit a file to the competition site. Because data are timely dependent, cross validation is not suitable for parameter selection. We provide a training-validation split (e.g., "avazu-app.tr" and "avazu-app.val") by consider the last 4,218,938 training instances for validation. [YJ16a]
# of classes: 2
# of data: 40,428,967 / 4,577,464 (testing) / 14,596,137 (avazu-app) / 1,719,304 (avazu-app.t) / 12,642,186 (avazu-app.tr) / 1,953,951 (avazu-app.val) / 25,832,830 (avazu-site) / 2,858,160 (avazu-site.t) / 23,567,843 (avazu-site.tr) / 2,264,987 (avazu-site.val)
# of features: 1,000,000
Files:
- avazu-app.bz2 (app)
- avazu-app.t.bz2 (app's testing)
- avazu-app.tr.bz2 (app's tr)
- avazu-app.val.bz2 (app's val)
- avazu-site.bz2 (site)
- avazu-site.t.bz2 (site's testing)
- avazu-site.tr.bz2 (site's tr)
- avazu-site.val.bz2 (site's val)
- avazu-submit.zip (code to generate a submission file)

breast-cancer

Source: UCI / Wisconsin Breast Cancer
Preprocessing: Note that the original data has the column 1 containing sample ID. Also 16 instances with missing values are removed.
# of classes: 2
# of data: 683
# of features: 10
Files:
- breast-cancer
- breast-cancer_scale (scaled to [-1,1])

cod-rna

Source: [AVU06a]
Features:
1. Divide by 10 to get deltaG_total value computed by the Dynalign algorithm
2. The length of shorter sequence
3. 'A' frequencies of sequence 1
4. 'U' frequencies of sequence 1
5. 'C' frequencies of sequence 1
6. 'A' frequencies of sequence 2
7. 'U' frequencies of sequence 2
8. 'C' frequencies of sequence 2
# of classes: 2
# of data: 59,535 / 271617 (validation) / 157413 (unused/remaining)
# of features: 8
Files:
- cod-rna (training)
- cod-rna.t (validation)
- cod-rna.r (unused/remaining)

colon-cancer

Source: [AU99a]
Preprocessing: Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. [SKS03a]
# of classes: 2
# of data: 62
# of features: 2,000
Files:
- colon-cancer.bz2

covtype.binary

Source: UCI / Covertype
Preprocessing: Transform from multiclass into binary class. [RC02a]
# of classes: 2
# of data: 581,012
# of features: 54
Files:
- covtype.libsvm.binary.bz2
- covtype.libsvm.binary.scale.bz2 (scaled to [0,1])

criteo

Source: Criteo's Display Advertising Challenge
Preprocessing: This data is used in a competition on click-through rate prediction jointly hosted by Criteo and Kaggle in 2014. The script for transforming data to LIBFFM and LIBSVM formats is provided in the link down below. The features are generated based on a simplified version of the winning solution. Please download the scripts here and check the README file for details. We also provide code to generate submission files for evaluation at the competition site. [YJ16a]
# of classes: 2
# of data: 45,840,617 / 6,042,135 (testing)
# of features: 1,000,000
Files:
- criteo-research-kaggle-display-advertising-challenge-dataset.tar.gz (link to raw data at criteo)
- criteo.kaggle2014.svm.tar.xz (LIBSVM format)
- criteo_tb_trans.zip (code to generate data in LIBSVM format)

criteo_tb

Source: Criteo's Terabyte Click Logs
Preprocessing: The original data are click logs of 24 days and we made the first 23 days as the training set and the last day as the testing set. The features are generated based on a simplified version of the winning solution of a smaller-scaled competition on click-through rate prediction jointly hosted by Criteo and Kaggle in 2014. The script for transforming data to LIBFFM and LIBSVM formats is provided in the link down below. It is same as the one for the criteo data set. If you don't have enough RAM to run LIBLINEAR, you can use the following code at LIBSVM tools and see our experimental log here. The code used is a disk-level linear classifier.
# of classes: 2
# of data: 4,195,197,692 / 178,274,637 (testing)
# of features: 1,000,000
Files:
- criteo_tb.svm.tar.xz (LIBSVM format)
- criteo_tb_trans.zip (code to generate data)
- logloss.py (code to calculate log loss)

diabetes

Source: UCI / Pima Indians Diabetes
# of classes: 2
# of data: 768
# of features: 8
Files:
- diabetes
- diabetes_scale (scaled to [-1,1])

duke breast-cancer

Source: [MW01a]
Preprocessing: Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. The original dataset consists of 49 instances. Five are removed since the classification results using immunohistochemistry and protein immunoblotting assay conflicted. Of the remaining, two instances were rejected due to failed array hybridization. The rest data are further split into training (38), and validation (4). [SKS03a]
# of classes: 2
# of data: 44
# of features: 7,129
Files:
- duke.bz2
- duke.tr.bz2 (tr)
- duke.val.bz2 (val)

epsilon

Source: PASCAL Challenge 2008
Preprocessing: The raw data set (epsilon_train) is instance-wisely scaled to unit length and split into two parts: 4/5 for training and 1/5 for testing. The training part is feature-wisely normalized to mean zero and variance one and then instance-wisely scaled to unit length. Using the scaling factors of the training part, the testing part is processed in a similar way. These train and testing data sets are used in [GXY11a]
# of classes: 2
# of data: 400,000 / 100,000 (testing)
# of features: 2,000
Files:
- epsilon_normalized.bz2
- epsilon_normalized.t.bz2 (testing)

fourclass

Source: [TKH96a]
Preprocessing: transform to two-class
# of classes: 2
# of data: 862
# of features: 2
Files:
- fourclass
- fourclass_scale (scaled to [-1,1])

german.numer

Source: Statlog / German
# of classes: 2
# of data: 1,000
# of features: 24
Files:
- german.numer
- german.numer_scale (scaled to [-1,1])

gisette

Source: NIPS 2003 Feature Selection Challenge [IG05a]
Preprocessing: The data set is also available at UCI. Because the labels of testing set are not available, here we use the validation set (gisette_valid.data and gisette_valid.labels) as the testing set. The training data (gisette_train) are feature-wisely scaled to [-1,1]. Then the testing data (gisette_valid) are scaled based on the same scaling factors for the training data. These two scaled data sets are used in [GXY11a]
# of classes: 2
# of data: 6,000 / 1,000 (testing)
# of features: 5,000
Files:
- gisette_scale.bz2
- gisette_scale.t.bz2 (testing)

heart

Source: Statlog / Heart
# of classes: 2
# of data: 270
# of features: 13
Files:
- heart
- heart_scale (scaled to [-1,1])

HIGGS

Source: UCI / HIGGS
Preprocessing: In the original paper, the last 500,000 instances are used for testing, while the remaining are for training. See [PB14a]
# of classes: 2
# of data: 11,000,000
# of features: 28
Files:
- HIGGS.xz

Hyperpartisan News Detection

Source: SemEval-2019 Task 4: Hyperpartisan News Detection
Preprocessing: The original dataset can be downloaded from Zenodo. The texts and labels are stored in articles-training-byarticle-20181122.zip and ground-truth-training-byarticle-20181122.zip respectively. We followed the preprocessing procedures from Longformer. We provide both a full-size dataset and training/validation/test subsets that we split according to hp-splits.json.
# of classes: 2
# of data: 516 / 64 (validation) / 65 (testing)
# of features:
Files:

ijcnn1

Source: [DP01a]
Preprocessing: We use winner's transformation [Chang01d]
# of classes: 2
# of data: 49,990 / 91,701 (testing)
# of features: 22
Files:
- ijcnn1.bz2
- ijcnn1.t.bz2 (testing)
- ijcnn1.tr.bz2 (tr)
- ijcnn1.val.bz2 (val)

imdb-sentiment

Source: Learning Word Vectors for Sentiment Analysis
Preprocessing: The original dataset can be downloaded from Large Movie Review Dataset. It has already been split into training and test datasets. We replaced any sequence of whitespace characters \s (a shorthand for [ \t\n\r\f\v]) with a space.
# of classes: 2
# of data: 25,000 / 25,000 (testing)
# of features:
Files:

ionosphere

Source: UCI / Ionosphere
# of classes: 2
# of data: 351
# of features: 34
Files:
- ionosphere_scale (scaled to [-1,1])

kdd2010 (algebra)

Source: KDD CUP 2010
Preprocessing: KDD Cup 2010 is an educational data mining competition. The data comes from Carnegie Learning and DataShop. This is the training set of the first problem: algebra_2008_2009. We provide a transformed version used by the winner (National Taiwan Univ). Because labels of the competition's testing set are not available, the training data is split to two sets for training and validation. The validation set is called the testing set here. To access the raw data set, please check the above "KDD CUP 2010" link. This data set is only to be used for research purposes. Users please acknowledge the data is from Carnegie Learning and DataShop. [HFY10c]
# of classes: 2
# of data: 8,407,752 / 510,302 (testing)
# of features: 20,216,830 / 20,216,830 (testing)
Files:
- kdda.bz2
- kdda.t.bz2 (testing)

kdd2010 (bridge to algebra)

Source: KDD CUP 2010
Preprocessing: KDD Cup 2010 is an educational data mining competition. The data comes from Carnegie Learning and DataShop. This is the training set of the second problem: bridge_to_algebra_2008_2009. We provide a transformed version used by the winner (National Taiwan Univ). Because labels of the competition's testing set are not available, the training data is split to two sets for training and validation. The validation set is called the testing set here. To access the raw data set, please check the above "KDD CUP 2010" link. This data set is only to be used for research purposes. Users please acknowledge the data is from Carnegie Learning and DataShop. [HFY10c]
# of classes: 2
# of data: 19,264,097 / 748,401 (testing)
# of features: 29,890,095 / 29,890,095 (testing)
Files:
- kddb.bz2
- kddb.t.bz2 (testing)

kdd2010 raw version (bridge to algebra)

Source: KDD CUP 2010
Preprocessing: This data set comes from the same source as "kdd2010 (bridge to algebra)." All settings are the same except that we give raw data without applying the winner's feature engineering procedure. We provide data in the original format and in the LIBSVM format. For the LIBSVM-format data, we treat each feature as a categorical type and use binary encoding to generate a sparse feature vector. This set was used in experiments in [YJ16a].
# of classes: 2
# of data: 19,264,097 / 748,401 (testing)
# of features: 1,163,024
Files:
- kddb-raw.bz2 (raw)
- kddb-raw.t.bz2 (raw: testing)
- kddb-raw-libsvm.bz2 (libsvm)
- kddb-raw-libsvm.t.bz2 (libsvm: testing)

kdd2012

Source: KDD CUP 2012
Preprocessing: We generate this data set from the official "training.txt" file of the second track in KDD CUP 2012. In the given file, each line gives a feature vector and the number of clicked/non-clicked impressions under these feature values. We set the label to be positive if the number of clicks is non-zero and negative otherwise. Every feature is treated as categorical and converted to binary features according to the number of possible categories. In addition, each feature vector is normalized to have unit length. Because the official evaluation system no longer works, we also provide a 80-20 split used in our paper for calculating the test score. [YJ16a]
# of classes: 2
# of data: 149,639,105 / 119,705,032 (training) / 29,934,073 (validation)
# of features: 54,686,452
Files:
- kdd12.xz
- kdd12.tr.xz (tr)
- kdd12.val.xz (val)

leukemia

Source: [TG99a]
Preprocessing: Merge training/testing. Instance-wise normalization to mean zero and variance one. Then feature-wise normalization to mean zero and variance one. [SKS03a]
# of classes: 2
# of data: 38 / 34 (testing)
# of features: 7129
Files:
- leu.bz2
- leu.t.bz2 (testing)

liver-disorders

Source: UCI / Liver-disorders
Preprocessing: The original data set has 7 variables per instance. The last variable is a selector indicating whether an instance goes to training or testing data set. Previously, the data set was wrongly interpreted by using the last variable as the label. Since May 21, 2016, we have followed the recommendation made by James McDermott and the data set donor Richard S. Forsyth to address the issue. Now the label of an instance is determined by the 6th variable: if the 6th variable is larger than 3, than the label is 1; otherwise it's 0. [JM16a]
# of classes: 2
# of data: 145 / 200 (testing)
# of features: 5
Files:
- liver-disorders
- liver-disorders_scale (scaled to [-1,1])
- liver-disorders.t (testing)

madelon

Source: NIPS 2003 Feature Selection Challenge [IG05a]
Preprocessing: The data set is also available at UCI. Because the labels of testing set are not available, here we use the validation set (madelon_valid.data and madelon_valid.labels) as the testing set.
# of classes: 2
# of data: 2,000 / 600 (testing)
# of features: 500
Files:
- madelon
- madelon.t (testing)

mushrooms

Source: UCI / mushrooms
Preprocessing: Each nominal attribute is expanded into several binary attributes. The original attribute #12 has missing values and is not used.
# of classes: 2
# of data: 8124
# of features: 112
Files:
- mushrooms

news20.binary

Source: [SSK05a]
Preprocessing: Each instance has unit length.
# of classes: 2
# of data: 19,996
# of features: 1,355,191
Files:
- news20.binary.bz2

phishing

Source: UCI / Phishing Websites
Preprocessing: All features are categorical. We use binary encoding to generate feature vectors. Each feature vector is normalized to maintain unit-length. [YJ16a]
# of classes: 2
# of data: 11,055
# of features: 68
Files:
- phishing

rcv1.binary

Source: [DL04b]
Preprocessing: positive: CCAT, ECAT; negative: GCAT, MCAT; instances in both positive and negative classes are removed.
# of classes: 2
# of data: 20,242 / 677,399 (testing)
# of features: 47,236
Files:
- rcv1_train.binary.bz2
- rcv1_test.binary.bz2 (testing)

real-sim

Source: A. McCallum / Real vs. Simulated
Preprocessing: Vikas Sindhwani for the SVMlin project
# of classes: 2
# of data: 72,309
# of features: 20,958
Files:
- real-sim.bz2

skin_nonskin

Source: UCI / Skin Segmentation Data Set
# of classes: 2
# of data: 245,057
# of features: 3
Files:
- skin_nonskin

splice

Source: Delve / splice
# of classes: 2
# of data: 1,000 / 2,175 (testing)
# of features: 60
Files:
- splice
- splice_scale (scaled to [-1,1])
- splice.t (testing)

splice-site

Source: [SS10a,AA12a] / splice-site
Preprocessing: In Sonnenburg and Franc (2010) for splice site prediction, they internally map data to a high dimensional space and train a linear classifier. Agarwal et al. (2014) use the script to explicitly store all feature values. We apply the same script to the original data for generating training/test files:
```
> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_up data/H_sapiens_acc_all_examples_plain_50000000.label  
```
and
```
> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_up data/H_sapiens_acc_all_examples_plain_5e7_test.label  
```
This set is highly skewed, so auPRC (area under precision-recall curve) is the suitable criterion. Using matlab statistics toolbox, you can obtain auPRC by
```
[Xpr,Ypr,Tpr,AUCpr] = perfcurve(labels, predictions, 1, 'xCrit', 'reca', 'yCrit', 'prec'); AUCpr
```
where labels are true labels and predictions are your predicted decision values. You can use LIBLINEAR with option -s 3 (i.e., l2-regularized l1-loss SVM) to get auPRC of 0.5773, similar to 0.5775 reported in Table 2 of Sonnenburg and Franc (2010). If you don't have enough RAM to run LIBLINEAR, you can use the following code at LIBSVM tools and see our experimental log here. The code used is a disk-level linear classifier. [HFY11a]
# of classes: 2
# of data: 50,000,000 / 4,627,840 (testing)
# of features: 11,725,480
Files:
- splice_site.xz (md5sum=df3bd1b65b9df5776907721dff4fdb4e)
- splice_site.t.xz (testing)

sonar

Source: UCI / Undocumented/Sonar
# of classes: 2
# of data: 208
# of features: 60
Files:
- sonar_scale (scaled to [-1,1])

SUSY

Source: UCI / SUSY
Preprocessing: In the original paper, the last 500,000 instances are used for testing, while the remaining are for training. See [PB14a]
# of classes: 2
# of data: 5,000,000
# of features: 18
Files:
- SUSY.xz

svmguide1

Source: [CWH03a]
Preprocessing: Original data: an astroparticle application from Jan Conrad of Uppsala University, Sweden.
# of classes: 2
# of data: 3,089 / 4,000 (testing)
# of features: 4
Files:
- svmguide1
- svmguide1.t (testing)

svmguide3

Source: [CWH03a]
Preprocessing: Original data: someone from Germany working with the car industry.
# of classes: 2
# of data: 1,243 / 41 (testing)
# of features: 21
Files:
- svmguide3
- svmguide3.t (testing)

url

Source: [JM09a]
Preprocessing: The file "url_original.tar.bz2" contains a directory 121 days, in which the file "FeatureTypes" gives indices of real-valued features (other features are 0/1). The file "url_combined.bz2" combines all 121-day data into one file. See more details in this page.
# of classes: 2
# of data: 2,396,130
# of features: 3,231,961
Files:
- url_combined.bz2
- url_combined_normalized.bz2 (scaled to unit length for each instance)
- url_original.tar.bz2

w1a

Source: [JP98a]
# of classes: 2
# of data: 2,477 / 47,272 (testing)
# of features: 300 / 300 (testing)
Files:
- w1a
- w1a.t (testing)

w2a

Source: [JP98a]
# of classes: 2
# of data: 3,470 / 46,279 (testing)
# of features: 300 / 300 (testing)
Files:
- w2a
- w2a.t (testing)

w3a

Source: [JP98a]
# of classes: 2
# of data: 4,912 / 44,837 (testing)
# of features: 300 / 300 (testing)
Files:
- w3a
- w3a.t (testing)

w4a

Source: [JP98a]
# of classes: 2
# of data: 7,366 / 42,383 (testing)
# of features: 300 / 300 (testing)
Files:
- w4a
- w4a.t (testing)

w5a

Source: [JP98a]
# of classes: 2
# of data: 9,888 / 39,861 (testing)
# of features: 300 / 300 (testing)
Files:
- w5a
- w5a.t (testing)

w6a

Source: [JP98a]
# of classes: 2
# of data: 17,188 / 32,561 (testing)
# of features: 300 / 300 (testing)
Files:
- w6a
- w6a.t (testing)

w7a

Source: [JP98a]
# of classes: 2
# of data: 24,692 / 25,057 (testing)
# of features: 300 / 300 (testing)
Files:
- w7a
- w7a.t (testing)

w8a

Source: [JP98a]
# of classes: 2
# of data: 49,749 / 14,951 (testing)
# of features: 300 / 300 (testing)
Files:
- w8a
- w8a.t (testing)

webspam

Source: Webb Spam Corpus [ST06a]
Preprocessing: We consider the subset used in the Pascal Large Scale Learning Challenge. According to Soeren Sonnenburg, all positive examples were taken and the negative examples were created by randomly traversing the Internet starting at well known (e.g. news) web-sites. We treat continuous n bytes as a word: trigram if n = 3 and unigram if n = 1. We use word count as the feature value and normalize each instance to unit length. For unigram, the number of features is 254. Contact us if you need scripts to obtain the data from the original documents. Note that the trigram version contains only 680,715 nonzero feature columns.
# of classes: 2
# of data: 350,000
# of features: 16,609,143
Files:
- webspam_wc_normalized_trigram.svm.xz (Tri-gram)
- webspam_wc_normalized_unigram.svm.xz (Uni-gram)