LIBSVM Data: Classification (Multi-class)

This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.

aloi

Source: aloi [AR14a]
# of classes: 1,000
# of data: 108,000
# of features: 128
Files:
- aloi.bz2
- aloi.scale.bz2 (scaled to [0,1])

cifar10

Source: The CIFAR-10 dataset [AK09a]
Preprocessing: We combine five training batches in CIFAR-10 Matlab version from the cifar10 website to produce the training data. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ...
# of classes: 10
# of data: 50,000 / 10,000 (testing)
# of features: 3,072
Files:
- cifar10.bz2
- cifar10.t.bz2 (testing)
- cifar10.mat (dense matlab format)
- cifar10.t.mat (testing, dense matlab format)

connect-4

Source: UCI / Connect-4
Preprocessing: We used binary encoding for each feature (o, b, x), so the number of features is 42*3 = 126.
# of classes: 3
# of data: 67,557
# of features: 126
Files:
- connect-4

covtype

Source: UCI / Covertype
# of classes: 7
# of data: 581,012
# of features: 54
Files:
- covtype.bz2
- covtype.scale01.bz2 (scaled to [0,1])
- covtype.scale.bz2 (scaled to mean zero and standard deviation one (first 10 attributes))

dna

Source: Statlog / Dna
Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
# of classes: 3
# of data: 2,000 / 1,186 (testing) / 1,400 (tr) / 600 (val)
# of features: 180
Files:
- dna.scale
- dna.scale.t (testing)
- dna.scale.tr (tr)
- dna.scale.val (val)

glass

Source: UCI / Glass Identification
# of classes: 6
# of data: 214
# of features: 9
Files:
- glass.scale

imdb-rating

Source: Jointly Modelling Aspects, Ratings and Sentiments for Movie Recommendation
Preprocessing: The original dataset can be downloaded from Zenodo. We replaced any sequence of whitespace characters \s (a shorthand for [ \t\n\r\f\v]) with a space.
# of classes: 10
# of data: 348,415
# of features:
Files:
- generate_imdb_rating.py
- imdb_rating_raw_texts.txt.bz2

iris

Source: UCI / Iris Plant
# of classes: 3
# of data: 150
# of features: 4
Files:
- iris.scale

LEDGAR (LexGLUE)

Source: [IC22b]
Preprocessing: The procedure is the same as that for ECtHR (A) (LexGLUE).
# of classes: 100
# of data: 60,000 / 10,000 (valid) / 10,000 (testing)
# of features: 19,996
Files:
- lexglue_code.tar.gz
- ledgar_lexglue_raw_texts_train.txt.bz2
- ledgar_lexglue_raw_texts_val.txt.bz2 (validation)
- ledgar_lexglue_raw_texts_test.txt.bz2 (testing)
- ledgar_lexglue_tfidf_train.svm.bz2
- ledgar_lexglue_tfidf_test.svm.bz2 (testing)

letter

Source: Statlog / Letter
Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
# of classes: 26
# of data: 15,000 / 5,000 (testing) / 10,500 (tr) / 4,500 (val)
# of features: 16
Files:
- letter.scale
- letter.scale.t (testing)
- letter.scale.tr (tr)
- letter.scale.val (val)

mnist

Source: [YL98a]
Preprocessing: Feature values are stored by rows of each image
# of classes: 10
# of data: 60,000 / 10,000 (testing)
# of features: 780 / 778 (testing)
Files:
- mnist.bz2
- mnist.t.bz2 (testing)
- mnist.scale.bz2 (scaled to [0,1] by dividing each feature by 255)
- mnist.scale.t.bz2 (testing) (scaled to [0,1] by dividing each feature by 255)
- mnist.mat (dense matlab format)
- mnist.t.mat (testing, dense matlab format)

mnist8m

Source: Invariant SVM [GL07b]
# of classes: 10
# of data: 8,100,000
# of features: 784
Files:
- mnist8m.xz
- mnist8m.scale.xz (scaled to [0,1] by dividing each feature by 255)

news20

Source: [KL95a]
Preprocessing: First 80/20 training/testing split. Also see this page [JR01a]
# of classes: 20
# of data: 15,935 / 3,993 (testing)
# of features: 62,061 / 62,060 (testing)
Files:
- news20.bz2
- news20.t.bz2 (testing)
- news20.scale.bz2 (scaled to binary encoding; then unit length for each instance)
- news20.t.scale.bz2 (testing) (scaled to binary encoding; then unit length for each instance)

news20 (18,846)

Source: [KL95a]
Preprocessing: The data are downloaded from sklearn. We have made sure the data provided by sklearn is the same as the 18,846 set at this page. All newlines are replaced with white spaces in addition. The raw data are in the format of labels<TAB>texts. We do a random 80/20 split to generate the validation set from the whole training set (raw texts only). We also provide data with tf-idf features, which are calculated from the raw texts provided here using TfidfVectorizer from sklearn with default configurations. The code used to generate the raw texts and tf-idf features is provided.
# of classes: 20
# of data: 9,051 / 2,263 (valid) / 7,532 (testing)
# of features: 130,107
Files:
- news20_code.tar.gz
- news20_raw_texts_train.txt.bz2
- news20_raw_texts_val.txt.bz2 (validation)
- news20_raw_texts_test.txt.bz2 (testing)
- news20_tfidf_train.svm.bz2
- news20_tfidf_test.svm.bz2 (testing)

pendigits

Source: UCI / Pen-Based Recognition of Handwritten Digits Data Set
# of classes: 10
# of data: 7,494 / 3,498 (testing)
# of features: 16
Files:
- pendigits
- pendigits.t (testing)

poker

Source: UCI / Poker Hand
# of classes: 10
# of data: 25,010 / 1,000,000 (testing)
# of features: 10
Files:
- poker.bz2
- poker.t.bz2

protein

Source: [JYW02a]
# of classes: 3
# of data: 17,766 / 6,621 (testing) / 14,895 (training) / 2,871 (validation)
# of features: 357
Files:
- protein.bz2
- protein.t.bz2 (testing)
- protein.tr.bz2 (tr)
- protein.val.bz2 (val)

rcv1.multiclass

Source: [DL04b]
Preprocessing: First, label hierarchy is reorganized by mapping the data set to the second level of RCV1 topic hierarchy. The documents that have labels of the third or forth level only are mapped to their parent category of the second level. The documents that only have labels of the first level are not mapped onto any category. Second, we remove multi-labeled instances. [RB08a]
# of classes: 53
# of data: 15,564 / 518,571 (testing)
# of features: 47,236
Files:
- rcv1_train.multiclass.bz2
- rcv1_test.multiclass.bz2 (testing)

SCOTUS (LexGLUE)

Source: [IC22b]
Preprocessing: The procedure is the same as that for ECtHR (A) (LexGLUE).
# of classes: 13
# of data: 5,000 / 1,400 (validation) / 1,400 (testing)
# of features: 126,405
Files:
- lexglue_code.tar.gz
- scotus_lexglue_raw_texts_train.txt.bz2
- scotus_lexglue_raw_texts_val.txt.bz2 (validation)
- scotus_lexglue_raw_texts_test.txt.bz2 (testing)
- scotus_lexglue_tfidf_train.svm.bz2
- scotus_lexglue_tfidf_test.svm.bz2 (testing)

satimage

Source: Statlog / Satimage
Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
# of classes: 6
# of data: 4,435 / 2,000 (testing) / 3,104 (tr) / 1,331 (val)
# of features: 36
Files:
- satimage.scale
- satimage.scale.t (testing)
- satimage.scale.tr (tr)
- satimage.scale.val (val)

sector

Source: [AM98a]
Preprocessing: The scaled data was used in our KDD 08 paper. For unknown reason we could now only generate something close to it. The sources are from this page. We select train-0.tc and test-0.tc from ecoc-svm-data.tar.gz. A 2/1 training/testing split gives training and testing sets below. They are in the original format instead of the libsvm format: in each row the 2nd value gives the class label and subsequent numbers give pairs of feature IDs and values. We then do a kind of tf-idf transformation: ln(1+tf)*log_2(#docs/#coll_freq_of_term) and normalize each instance to unit length. [JR01b,SSK08a]
# of classes: 105
# of data: 6,412 / 3,207 (testing)
# of features: 55,197 / 55,197 (testing)
Files:
- sector.bz2
- sector.t.bz2 (testing)
- sector.scale.bz2
- sector.t.scale.bz2 (testing)

segment

Source: Statlog / Segment
# of classes: 7
# of data: 2,310
# of features: 19
Files:
- segment.scale

Sensorless

Source: UCI / Dataset for Sensorless Drive Diagnosis
Preprocessing: The original data does not have test instances. For the [0,1]-scaled version we have a random split (.tr and .val) used in our paper. [CCW16a]
# of classes: 11
# of data: 58,509
# of features: 48
Files:
- Sensorless
- Sensorless.scale (scaled to [0,1])
- Sensorless.scale.tr
- Sensorless.scale.val

shuttle

Source: Statlog / Shuttle
Preprocessing: Training data is further separated into two sets, tr and val. [CWH01a]
# of classes: 7
# of data: 43,500 / 14,500 (testing) / 30,450 (tr) / 13,050 (val)
# of features: 9
Files:
- shuttle.scale
- shuttle.scale.t (testing)
- shuttle.scale.tr (tr)
- shuttle.scale.val (val)

smallNORB

Source: The Small NORB Dataset [YL04b]
Preprocessing: For each instance, from two cameras, it contains a pair of 96x96 grayscale images for two different channels. We downsample each channel of the origin data from 96x96 to 32x32 by selecting the maximum pixel value within every 3x3 disjoint region. Feature values are generated by (row 1, channel 1), (row 2, channel 1), ..., (row 1, channel 2), ... [CCW18a]
# of classes: 5
# of data: 24,300 / 24,300 (testing)
# of features: 18,432 / 2,048 (downsampled)
Files:
- smallNORB.xz
- smallNORB.t.xz (testing)
- smallNORB-32x32.xz (downsampled)
- smallNORB-32x32.t.xz (downsampled, testing)
- smallNORB-32x32.mat (dense matlab format)
- smallNORB-32x32.t.mat (testing, dense matlab format)

SVHN

Source: SVHN [YN11a]
Preprocessing: We consider format 2 (cropped digits) of the data set. For every image, in the RGB order, by rows we convert 32x32 pixels to feature values. That is, (row 1, R), (row 2, R), ..., (row 1, G), ... [YN11a]
# of classes: 10
# of data: 73,257 / 26,032 (testing) / 531,131 (extra)
# of features: 3,072
Files:
- SVHN.xz
- SVHN.t.xz (testing)
- SVHN.extra.xz (extra data from the original source)
- SVHN.scale.xz (scaled to [0,1] by dividing each feature by 255)
- SVHN.scale.t.xz (testing) (scaled to [0,1] by dividing each feature by 255)
- SVHN.scale.extra.xz (scaled to [0,1] by dividing each feature by 255)
- SVHN.mat (dense matlab format)
- SVHN.t.mat (testing, dense matlab format)

svmguide2

Source: [CWH03a]
Preprocessing: Original data: a bioinformatics application from Simon Fraser University, Canada. [JLG03a]
# of classes: 3
# of data: 391
# of features: 20
Files:
- svmguide2

svmguide4

Source: [CWH03a]
Preprocessing: Original data: an application on traffic light signals from Georges Bonga at University of Applied Sciences, Berlin.
# of classes: 6
# of data: 300 / 312 (testing)
# of features: 10
Files:
- svmguide4
- svmguide4.t (testing)

usps

Source: [JJH94a]
# of classes: 10
# of data: 7,291 / 2,007 (testing)
# of features: 256
Files:
- usps.bz2
- usps.t.bz2 (testing)

SensIT Vehicle (acoustic)

Source: Sensit [MD04a]
Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. [MD04a]
# of classes: 3
# of data: 78,823 / 19,705 (testing)
# of features: 50
Files:
- acoustic
- acoustic.t (testing)
- acoustic_scale (scaled to [-1,1])
- acoustic_scale.t (testing)

SensIT Vehicle (seismic)

Source: Sensit [MD04a]
Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. [MD04a]
# of classes: 3
# of data: 78,823 / 19,705 (testing)
# of features: 50
Files:
- seismic
- seismic.t (testing)
- seismic_scale (scaled to [-1,1])
- seismic_scale.t (testing)

SensIT Vehicle (combined)

Source: Sensit [MD04a]
Preprocessing: Regenerate features by the authors' matlab scripts (see Sec. C of Appendix A), then randomly select 10% instances from the noise class so that the class proportion is 1:1:2 (AAV:DW:noise). The training/testing sets are from a random 80% and 20% split of the data. The first 50 features are acoustic, while the rest are seismic. Due to the random selection, files here are not the direct concatenation of the "SensIT Vehicle (acoustic)" and "SensIT Vehicle (seismic)" sets. [MD04a]
# of classes: 3
# of data: 78,823 / 19,705 (testing)
# of features: 100
Files:
- combined
- combined.t (testing)
- combined_scale (scaled to [-1,1])
- combined_scale.t (testing)

vehicle

Source: Statlog / Vehicle
# of classes: 4
# of data: 846
# of features: 18
Files:
- vehicle.original (original)
- vehicle.scale (scaled to [-1,1])

vowel

Source: UCI / Vowel
Preprocessing: First 528 instances are used as training and the remaining instances are for testing. Scaling training data first and adjust testing data accordingly.
# of classes: 11
# of data: 528 / 462 (testing)
# of features: 10
Files:
- vowel
- vowel.t (testing)
- vowel.scale (scaled to [-1,1])
- vowel.scale.t (testing)

wine

Source: UCI / Wine Recognition
# of classes: 3
# of data: 178
# of features: 13
Files:
- wine.scale