LIBSVM Data: Classification (Multi-class)
This page contains many classification, regression,
multi-label and string data sets stored in LIBSVM format. For some sets
raw materials (e.g., original texts) are also available. These data sets
are from UCI, Statlog, StatLib and other collections. We
thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
aloi
- Source:
aloi
[AR14a]
- # of classes: 1,000
- # of data:
108,000
- # of features:
128
- Files:
cifar10
- Source:
The CIFAR-10 dataset
[AK09a]
- Preprocessing:
We combine five training batches in CIFAR-10 Matlab version from the cifar10 website to produce the training data. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ...
- # of classes: 10
- # of data:
50,000
/ 10,000 (testing)
- # of features:
3,072
- Files:
connect-4
- Source:
UCI
/ Connect-4
- Preprocessing:
We used binary encoding for each feature (o, b, x), so the number of features is 42*3 = 126.
- # of classes: 3
- # of data:
67,557
- # of features:
126
- Files:
covtype
- Source:
UCI
/ Covertype
- # of classes: 7
- # of data:
581,012
- # of features:
54
- Files:
dna
- Source:
Statlog
/ Dna
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 3
- # of data:
2,000
/ 1,186 (testing)
/ 1,400 (tr)
/ 600 (val)
- # of features:
180
- Files:
glass
- Source:
UCI
/ Glass Identification
- # of classes: 6
- # of data:
214
- # of features:
9
- Files:
imdb-rating
iris
- Source:
UCI
/ Iris Plant
- # of classes: 3
- # of data:
150
- # of features:
4
- Files:
LEDGAR (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The procedure is the same as that for ECtHR (A) (LexGLUE).
- # of classes: 100
- # of data:
60,000
/ 10,000 (valid)
/ 10,000 (testing)
- # of features:
19,996
- Files:
letter
- Source:
Statlog
/ Letter
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 26
- # of data:
15,000
/ 5,000 (testing)
/ 10,500 (tr)
/ 4,500 (val)
- # of features:
16
- Files:
mnist
- Source:
[YL98a]
- Preprocessing:
Feature values are stored by rows of each image
- # of classes: 10
- # of data:
60,000
/ 10,000 (testing)
- # of features:
780
/ 778 (testing)
- Files:
mnist8m
- Source:
Invariant SVM
[GL07b]
- # of classes: 10
- # of data:
8,100,000
- # of features:
784
- Files:
news20
- Source:
[KL95a]
- Preprocessing:
First 80/20 training/testing split. Also see
this page
[JR01a]
- # of classes: 20
- # of data:
15,935
/ 3,993 (testing)
- # of features:
62,061
/ 62,060 (testing)
- Files:
news20 (18,846)
- Source:
[KL95a]
- Preprocessing:
The data are downloaded from sklearn. We have made sure the data provided by sklearn is the same as the 18,846 set at this page. All newlines are replaced with white spaces in addition. The raw data are in the format of labels<TAB>texts. We do a random 80/20 split to generate the validation set from the whole training set (raw texts only). We also provide data with tf-idf features, which are calculated from the raw texts provided here using TfidfVectorizer from sklearn with default configurations. The code used to generate the raw texts and tf-idf features is provided.
- # of classes: 20
- # of data:
9,051
/ 2,263 (valid)
/ 7,532 (testing)
- # of features:
130,107
- Files:
pendigits
- Source:
UCI
/ Pen-Based Recognition of Handwritten Digits Data Set
- # of classes: 10
- # of data:
7,494
/ 3,498 (testing)
- # of features:
16
- Files:
poker
- Source:
UCI
/ Poker Hand
- # of classes: 10
- # of data:
25,010
/ 1,000,000 (testing)
- # of features:
10
- Files:
protein
- Source:
[JYW02a]
- # of classes: 3
- # of data:
17,766
/ 6,621 (testing)
/ 14,895 (training)
/ 2,871 (validation)
- # of features:
357
- Files:
rcv1.multiclass
- Source:
[DL04b]
- Preprocessing:
First, label hierarchy is reorganized by mapping the data set to the second level of RCV1 topic hierarchy. The documents that have labels of the third or forth level only are mapped to their parent category of the second level. The documents that only have labels of the first level are not mapped onto any category. Second, we remove multi-labeled instances.
[RB08a]
- # of classes: 53
- # of data:
15,564
/ 518,571 (testing)
- # of features:
47,236
- Files:
SCOTUS (LexGLUE)
- Source:
[IC22b]
- Preprocessing:
The procedure is the same as that for ECtHR (A) (LexGLUE).
- # of classes: 13
- # of data:
5,000
/ 1,400 (validation)
/ 1,400 (testing)
- # of features:
126,405
- Files:
satimage
- Source:
Statlog
/ Satimage
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 6
- # of data:
4,435
/ 2,000 (testing)
/ 3,104 (tr)
/ 1,331 (val)
- # of features:
36
- Files:
sector
- Source:
[AM98a]
- Preprocessing:
The scaled data was used in our KDD 08 paper.
For unknown reason we could now only generate
something close to it. The sources are
from
this page.
We select train-0.tc and test-0.tc from
ecoc-svm-data.tar.gz.
A 2/1 training/testing split gives
training and testing sets below. They
are in the original
format instead of the libsvm format: in each
row the 2nd value
gives the class label and subsequent numbers give
pairs of feature IDs and values.
We then do a kind of
tf-idf transformation: ln(1+tf)*log_2(#docs/#coll_freq_of_term) and normalize each instance to unit length.
[JR01b,SSK08a]
- # of classes: 105
- # of data:
6,412
/ 3,207 (testing)
- # of features:
55,197
/ 55,197 (testing)
- Files:
segment
- Source:
Statlog
/ Segment
- # of classes: 7
- # of data:
2,310
- # of features:
19
- Files:
Sensorless
- Source:
UCI
/ Dataset for Sensorless Drive Diagnosis
- Preprocessing:
The original data does not have test instances. For the [0,1]-scaled version we have a random split (.tr and .val) used in our paper.
[CCW16a]
- # of classes: 11
- # of data:
58,509
- # of features:
48
- Files:
shuttle
- Source:
Statlog
/ Shuttle
- Preprocessing:
Training data is further separated into two sets, tr and val.
[CWH01a]
- # of classes: 7
- # of data:
43,500
/ 14,500 (testing)
/ 30,450 (tr)
/ 13,050 (val)
- # of features:
9
- Files:
smallNORB
- Source:
The Small NORB Dataset
[YL04b]
- Preprocessing:
For each instance, from two cameras, it contains a pair of 96x96 grayscale images for two different channels. We downsample each channel of the origin data from 96x96 to 32x32 by selecting the maximum pixel value within every 3x3 disjoint region. Feature values are generated by (row 1, channel 1), (row 2, channel 1), ..., (row 1, channel 2), ...
[CCW18a]
- # of classes: 5
- # of data:
24,300
/ 24,300 (testing)
- # of features:
18,432
/ 2,048 (downsampled)
- Files:
SVHN
- Source:
SVHN
[YN11a]
- Preprocessing:
We consider format 2 (cropped digits) of the data set. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ...
[YN11a]
- # of classes: 10
- # of data:
73,257
/ 26,032 (testing)
/ 531,131 (extra)
- # of features:
3,072
- Files:
svmguide2
- Source:
[CWH03a]
- Preprocessing:
Original data: a bioinformatics application from Simon Fraser University, Canada.
[JLG03a]
- # of classes: 3
- # of data:
391
- # of features:
20
- Files:
svmguide4
- Source:
[CWH03a]
- Preprocessing:
Original data:
an application on traffic light
signals from
Georges Bonga at
University of Applied Sciences, Berlin.
- # of classes: 6
- # of data:
300
/ 312 (testing)
- # of features:
10
- Files:
usps
- Source:
[JJH94a]
- # of classes: 10
- # of data:
7,291
/ 2,007 (testing)
- # of features:
256
- Files:
SensIT Vehicle (acoustic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (seismic)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
50
- Files:
SensIT Vehicle (combined)
- Source:
Sensit
[MD04a]
- Preprocessing:
Regenerate features by the authors' matlab
scripts (see Sec. C of Appendix A), then randomly select 10% instances
from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise).
The training/testing sets are from a random 80% and 20% split of the data. The first 50 features are acoustic, while the rest are seismic. Due to the random selection, files here are not the direct concatenation of the "SensIT Vehicle (acoustic)" and "SensIT Vehicle (seismic)" sets.
[MD04a]
- # of classes: 3
- # of data:
78,823
/ 19,705 (testing)
- # of features:
100
- Files:
vehicle
- Source:
Statlog
/ Vehicle
- # of classes: 4
- # of data:
846
- # of features:
18
- Files:
vowel
- Source:
UCI
/ Vowel
- Preprocessing:
First 528 instances are used as training and the remaining instances are for testing. Scaling training data first and adjust testing data accordingly.
- # of classes: 11
- # of data:
528
/ 462 (testing)
- # of features:
10
- Files:
wine
- Source:
UCI
/ Wine Recognition
- # of classes: 3
- # of data:
178
- # of features:
13
- Files: