LIBSVM Data: Regression
This page contains many classification, regression,
multi-label and string data sets stored in LIBSVM format. For some sets
raw materials (e.g., original texts) are also available. These data sets
are from UCI, Statlog, StatLib and other collections. We
thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided)
is adjusted accordingly. Some training data are further separated
to "training" (tr) and "validation" (val) sets. Details can be
found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.
abalone
- Source:
UCI
/ Abalone
- # of data:
4,177
- # of features:
8
- Files:
bodyfat
- Source:
StatLib
/ bodyfat
- # of data:
252
- # of features:
14
- Files:
cadata
- Source:
StatLib
/ houses.zip
- # of data:
20,640
- # of features:
8
- Files:
cpusmall
- Source:
Delve
/ comp-activ
- # of data:
8,192
- # of features:
12
- Files:
E2006-log1p
- Source:
10-K Corpus
- Preprocessing:
The data set is obtained from
Noah Smith. Features include the volatility in the past 12 months and log-scaled term frequencies of unigrams and bigrams.
[SK09a]
- # of data:
16,087
/ 3,308 (testing)
- # of features:
4,272,227
- Files:
E2006-tfidf
- Source:
10-K Corpus
- Preprocessing:
The data set is obtained from
Noah Smith. Features include the volatility in the past 12 months and tf-idf of unigrams.
[SK09a]
- # of data:
16,087
/ 3,308 (testing)
- # of features:
150,360
- Files:
eunite2001
- Source:
/ Eunite 2001 competition
- Preprocessing:
In 2001, EUNITE network organized a competition on
mid-term load forecasting. Given load and some other information in 1997-1998,
the task is to predict daily maximum load
in January 1999.
Here we use the winner (National Taiwan Univ)'s transformation, so
only winter (Jan-March, Oct-Dec of 1997
and 1998) data are used for training.
Features 10-16 are
loads of the previous seven days, scaled by (x-min)/(max-min),
where min and max are the smallest and largest load in 1997-1998.
Because the 9th feature isn't used, the total number of features is indeed 15 .
The first column of eunite.t is the real
load for prediction. Features 10-15 in eunite.t are generated by
the "predicted result of the previous day" because true values
are not available during competition.
You can use the sample file
eunite.m below to evaluate the prediction results.
You need to use LIBSVM
matlab interface and copy data to the
same directory.
[BJC02a]
- # of data:
336
/ 31 (testing)
- # of features:
16
- Files:
housing
- Source:
UCI
/ Housing (Boston)
- # of data:
506
- # of features:
13
- Files:
mg
- Source:
[GWF01a]
- # of data:
1,385
- # of features:
6
- Files:
mpg
- Source:
UCI
/ Auto-Mpg
- # of data:
392
- # of features:
7
- Files:
pyrim
- Source:
UCI
/ Qualitative Structure Activity Relationships
- # of data:
74
- # of features:
27
- Files:
space_ga
- Source:
StatLib
/ space_ga
- # of data:
3,107
- # of features:
6
- Files:
triazines
- Source:
UCI
/ Qualitative Structure Activity Relationships
- # of data:
186
- # of features:
60
- Files:
YearPredictionMSD
- Source:
UCI
/ YearPredictionMSD
- # of data:
463,715
/ 51,630 (testing)
- # of features:
90
- Files: