LIBSVM Data: Regression

This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. For some sets raw materials (e.g., original texts) are also available. These data sets are from UCI, Statlog, StatLib and other collections. We thank their efforts. For most sets, we linearly scale each attribute to [-1,1] or [0,1]. The testing data (if provided) is adjusted accordingly. Some training data are further separated to "training" (tr) and "validation" (val) sets. Details can be found in the description of each data set. To read data via MATLAB, you can use "libsvmread" in LIBSVM package.

abalone

Source: UCI / Abalone
# of data: 4,177
# of features: 8
Files:
- abalone
- abalone_scale (scaled to [-1,1])

bodyfat

Source: StatLib / bodyfat
# of data: 252
# of features: 14
Files:
- bodyfat
- bodyfat_scale (scaled to [-1,1])

cadata

Source: StatLib / houses.zip
# of data: 20,640
# of features: 8
Files:
- cadata

cpusmall

Source: Delve / comp-activ
# of data: 8,192
# of features: 12
Files:
- cpusmall
- cpusmall_scale (scaled to [-1,1])

E2006-log1p

Source: 10-K Corpus
Preprocessing: The data set is obtained from Noah Smith. Features include the volatility in the past 12 months and log-scaled term frequencies of unigrams and bigrams. [SK09a]
# of data: 16,087 / 3,308 (testing)
# of features: 4,272,227
Files:
- log1p.E2006.train.bz2
- log1p.E2006.test.bz2 (testing)

E2006-tfidf

Source: 10-K Corpus
Preprocessing: The data set is obtained from Noah Smith. Features include the volatility in the past 12 months and tf-idf of unigrams. [SK09a]
# of data: 16,087 / 3,308 (testing)
# of features: 150,360
Files:
- E2006.train.bz2
- E2006.test.bz2 (testing)

eunite2001

Source: / Eunite 2001 competition
Preprocessing: In 2001, EUNITE network organized a competition on mid-term load forecasting. Given load and some other information in 1997-1998, the task is to predict daily maximum load in January 1999. Here we use the winner (National Taiwan Univ)'s transformation, so only winter (Jan-March, Oct-Dec of 1997 and 1998) data are used for training. Features 10-16 are loads of the previous seven days, scaled by (x-min)/(max-min), where min and max are the smallest and largest load in 1997-1998. Because the 9th feature isn't used, the total number of features is indeed 15 . The first column of eunite.t is the real load for prediction. Features 10-15 in eunite.t are generated by the "predicted result of the previous day" because true values are not available during competition. You can use the sample file eunite.m below to evaluate the prediction results. You need to use LIBSVM matlab interface and copy data to the same directory. [BJC02a]
# of data: 336 / 31 (testing)
# of features: 16
Files:

housing

Source: UCI / Housing (Boston)
# of data: 506
# of features: 13
Files:
- housing
- housing_scale (scaled to [-1,1])

mg

Source: [GWF01a]
# of data: 1,385
# of features: 6
Files:
- mg
- mg_scale (scaled to [-1,1])

mpg

Source: UCI / Auto-Mpg
# of data: 392
# of features: 7
Files:
- mpg
- mpg_scale (scaled to [-1,1])

pyrim

Source: UCI / Qualitative Structure Activity Relationships
# of data: 74
# of features: 27
Files:
- pyrim
- pyrim_scale (scaled to [-1,1])

space_ga

Source: StatLib / space_ga
# of data: 3,107
# of features: 6
Files:
- space_ga
- space_ga_scale (scaled to [-1,1])

triazines

Source: UCI / Qualitative Structure Activity Relationships
# of data: 186
# of features: 60
Files:
- triazines
- triazines_scale (scaled to [-1,1])

YearPredictionMSD

Source: UCI / YearPredictionMSD
# of data: 463,715 / 51,630 (testing)
# of features: 90
Files:
- YearPredictionMSD.bz2
- YearPredictionMSD.t.bz2 (testing)