Cross Validation using Higher-level Information to Split Data

A MATLAB/Octave Code

We provide this MATLAB/Octave code. Please

  1. Store it in the same directory of LIBSVM MATLAB/Octave interface.
  2. Have the LIBSVM MATLAB/Octave interface ready.

The usage is

metalevel_cv(label_vector, instance_matrix, size_of_groups [, 'libsvm_options']);
Assume you have the following data
users labels features
1     1      0.5 1 -0.4
1     1      ...
1     1      ...
2     -1     ...
2     -1     ...
3     1      ...
3     1      ...
3     1      ...  
4    -1      ...
4    -1      ...
4    -1      ...
4    -1      ...
...
Then the variable size_of_groups is a vector of
3 2 3 4 ...
to indicate the number of instances of each user. You can download the example file heart_scale.mat and run the following commands
>> load heart_scale.mat
>> who

Your variables are:

logindex  train_x   train_y   

>> metalevel_cv(train_y, train_x, logindex, '-v 3 -e 0.01');
to conduct three-fold CV. We pass the '-e 0.01' option to LIBSVM. Note that we assume instances of the same group have been put together in the variable train_x.

The default number of CV folds is five. Currently, we do not take the size of each group into consideration. Therefore, data folds may not be very balanced.

To ensure that the same split of data is used in conducting parameter selection, we fix the random seed internally. To use random seed by time, comment the following line in the file.

rand('seed', 1);

LIBLINEAR

The above MATLAB/Octave code can be used for LIBLINEAR, though in the place of calling svmtrain and svmpredict of LIBSVM, you must use train and predict instead.

Parameter Selection

To select parameters by grid.py of LIBSVM, we provide a Python code metalevel_cv.py to act like svm-train of LIBSVM. For example, grid.py calls

> ./svm-train -v 5 -c 4 -g 0.25 your_data
Cross Validation Accuracy = 90.1122%
in trying one parameter set. We replace svm-train by this python code:
> ./metalevel_cv.py -v 5 -c 4 -g 0.25 your_data
Cross Validation Accuracy = 84.1234%

Internally metalevel_cv.py calls metalevel_cv.m by Octave. We consider Octave rather than MATLAB because grid.py will launch many CV jobs in parallel and you may not have enough MATLAB licenses.

Because the data must contain meta information, we make the following requirements

  1. The data is in a format readable by Octave. For example, .mat files generated by MATLAB or Octave.
  2. The data contains three variables train_y, train_x, and logindex to store labels, features, and meta information, respectively.
Detailed procedure:
  1. Have Octave and LIBSVM Octave interface ready in your system.
  2. Download this Python code and put it in the directory of the MATLAB/Octave interface.
  3. In calling grid.py for parameter selection, specify this Python code as the LIBSVM training executable file.
Example:
> pwd
> /home/you/data_directory
> ls ../libsvm-3.17/tools/grid.py
../libsvm-3.17/tools/grid.py
> octave
octave> load your_data.mat
octave> who
train_y train_x logindex
octave> quit
> ../libsvm-3.17/tools/grid.py -log2c -2,3,2 -log2g -3,1,2 -m 500 -v 3 -svmtrain ../libsvm-3.17/matlab/metalevel_cv.py your_data.mat

Please contact Chih-Jen Lin for any question.