This is the README for reproducing our experiments in the paper,
"A Unified Algorithm for One-class Structured Matrix Factorization with Side Information".

Table of Contents
=================

- Prerequisites

- Multi-Label Problems
   - Part 1. Data Preparation
   - Part 2. Experiments Reproduction
   - Part 3. Train-Validation Split
   - Part 4. Grid Search
   - Part 5. Cleaner Grid Search Result

- One-class MF
   - Part 1. Data Preparation
   - Part 2. Experiments Reproduction
   - Part 3. Train-Validation Split
   - Part 4. Grid Search
   - Part 5. Cleaner Grid Search Result


Prerequisites
=============

- Matlab
- Python2
- Standard gcc/g++ (clang will not work)

Before doing anything, please go to unified-oneclass/matlab/ in Matlab.
Then type "make" in that directory under Matlab to create two mex files.
(imf_train and grmf_train)

For standard gcc/g++ compiler, the "make" command should work right away.
However since we are using OpenMP, some less standard gcc compiler may fail.
(for example, clang used by Mac OS)
To work on Mac OS, you have to download a standard gcc/g++ and change the default compiler used by Matlab.

After you create these two mex files, please move them to the current directory oc_exp_code/.
Then everything below that requires Matlab can be done in this directory.
(Remember to cd back to this directory in Matlab)


==================================
=      Multi-Label Problems      =
==================================
   
Multi-Label Problems: Part 1. Data Preparation
==============================================

To obtain the five data sets we used in our experiments ("bibtex", "delicious", "mediamill", "eurlex" and "wiki10"),
please run "python download.py" under the data/ directory (needs to cd to data/).
It will automatically download the five data sets as five .mat files in data/.
(Actually it will also download the data sets used in One-class MF)

If you only want to download some particular data sets,
please download from http://www.csie.ntu.edu.tw/~cjlin/papers/ocmf-side/data/.
Then manually put the data set under data/ directory.

Now you can proceed to Part 2.

| How to prepare your own data set?
| 
| The .mat file actually contains a single Matlab structure named "data".
| And "data" contains X (feature matrix), Y (label matrix), Xt and Yt (testing set).
| 
| X is an m x d sparse matrix, Y is an m x n sparse matrix (with only 1s and 0s),
| Xt is an m_t x d sparse matrix and Yt is an m_t x n sparse matrix,
| where m is the # of training instances, m_t is the # of testing instances,
|       d is the # of features and n is the # of labels.
| 
| Thus by creating the same form of Matlab structure, you are done with your data preparation.


Multi-Label Problems: Part 2. Experiments Reproduction
======================================================

This part shows how to reproduce the experimental results shown in our paper.

Since the problems require some hyper-parameter tuning, we apply grid search to find the best parameters.
For applying grid search, please refer to Part 3.

In this part, we have already plugged-in the parameters found after grid search.
Thus by running the following Matlab scripts, you are able to reproduce the numbers shown in our paper.
(note that 16 cores are used by default)
(note that for methods involving randomness, such as subsampled approaches, the result may be differ slightly)
(note that the corresponding data set should be under the data/ directory)

run_bibtex.m
run_delicious.m
run_mediamill.m
run_eurlex.m
run_wiki10.m

For example, to reproduce the result of bibtex dataset, just type "run_bibtex" in Matlab.
The titles shown in the output correspond to the following formulations

leml: the SQ-SQ formulation in FULL approach (equivalent to LEML, also in our proposed framework).
square: the SQ-wSQ formulation in FULL approach (in our proposed framework)
logis: the LR-wSQ formulation in FULL approach (in our proposed framework)
logis+nystrom: the LR-wSQ formulation in FULL approach w/ Nystrom nonlinear features (in our proposed framework)
full logis: the LR-wLR formulation in FULL approach (runs very slow in practice, just for comparison)
sub logis: the LR-LR formulation in SUBSAMPLED approach (gives inferior performance in practice, just for comparison)
sub square: the SQ-SQ formulation in SUBSAMPLED approach (gives inferior performance in practice, just for comparison)

| Additional Data Preprocessing:
| 
| For the five data sets, we have append an one at the end of each feature vector.
| (done using data/add_bias.m)
| For the "eurlex" data set, we have instance-wisely normalized the feature vectors before appending the one.
| (done using data/instance_norm.m)
| 
| These preprocessing steps can be seen in the top few lines of each run_<dataname>.m script.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! If you only want to reproduce the numbers in our experiments, !!!
!!!     you can skip Part 3 to Part 5 (for grid search only).     !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Multi-Label Problems: Part 3. Train-Validation Split
====================================================

Before applying grid search, we need a validation set to obtain the performance measure of each parameters.
To prepare for the training-validation set for multi-label problems (different from one-class matrix factorization problem),
you can use the following Matlab script: "data/get_validation.m".

It will produce a Matlab structure with 4:1 train-val random split, which is of the same form as the note in Part 1.
But with Xt, Yt being the validation set rather than the testing set.

For example, the following commands create the train-validation split for the data sets used in our experiment.
(note that we have apply some data preprocessing as mentioned in Part 2.)

"""""""""""""""""""""""""""""""""""""""""""""""
addpath('data/')

load 'bibtex.mat'
bib_val = add_bias(get_validation(data));

load 'delicious.mat'
deli_val = add_bias(get_validation(data));

load 'mediamill.mat'
medi_val = add_bias(get_validation(data));

load 'eurlex.mat'
eurl_val = add_bias(get_validation(instance_norm(data)));

load 'wiki10.mat'
wiki_val = add_bias(get_validation(wiki));

"""""""""""""""""""""""""""""""""""""""""""""""


Multi-Label Problems: Part 4. Grid Search
=========================================

To grid search on multi-label problems, use the following Matlab scripts in grid_search/.
Before running them, please addpath to data/, grid_search/, miscellaneous/ and Nystrom/.

1) run_leml.m: the SQ-SQ formulation in FULL approach (equivalent to LEML, also in our proposed framework)
2) run_square.m: the SQ-wSQ formulation in FULL approach (in our proposed framework)
3) run_logis.m: the LR-wSQ formulation in FULL approach (in our proposed framework)
4) run_nystrom.m: the LR-wSQ formulation in FULL approach w/ Nystrom nonlinear features (in our proposed framework)
5) run_logis_full.m: the LR-wLR formulation in FULL approach (runs very slow in practice, just for comparison)
6) run_subsquare.m: the SQ-SQ formulation in SUBSAMPLED approach (gives inferior performance in practice, just for comparison)
7) run_sublogis.m: the LR-LR formulation in SUBSAMPLED approach (gives inferior performance in practice, just for comparison)

All of these Matlab functions take two input arguments: "data" and "k".
data: the structure containing X (the feature matrix), Y (the label matrix), Xt and Yt (validation set).
k: the number of latent variable.

Note that bib_val, deli_val, ... etc. from Part 3, can be directly fed into the first argument of these Matlab functions.

For run_nystrom.m, it takes an additional parameter: "CN".
CN: the number of columns selected for doing Nystrom appoximation.

For example, we can do grid search on bibtex data set using FULL SQ-SQ formulation as follows:
(Note that the "diary" command is used to store the lengthy output to a txt file)

"""""""""""""""""""""""""""""""""""""""""""""""
addpath('data/')
addpath('grid_search/')
addpath('miscellaneous/')
addpath('Nystrom/')

% Assuming bib_val is in Workspace, which is obtained from Part 3.
diary 'bibtex_leml150.txt'
run_leml(bib_val, 150) % Using k = 150
diary off

"""""""""""""""""""""""""""""""""""""""""""""""


Multi-Label Problems: Part 5. Cleaner Grid Search Result
========================================================

Since the output for the above grid search functions can be long and complicated,
(due to the various performance measures for each iterations under every parameters)
we have wrote a short python code "grid_search/grid_tool.py" to create a cleaner result.
The usage is as follows: (type it in this current directory)

python grid_search/grid_tool.py <measure> <file>

<measure>: can be any of our supported performance measure, including p@1~p@5 (precision), n@1~n@5 (nDCG), ... etc.
<file>: this file should contain the output of the above grid search scripts. (can be stored by e.g. using Matlab "diary" command)

The output is several lines of <the parameters> : <the performance> ( <best number of iteration> )
You can use "sort" command under linux to obtain the best parameter.

For example, if we want to obtain a cleaner result for grid search on bibtex using FULL SQ-SQ formulation,
just do the following in the command line: (shows the best p@5 performance for each parameter)

"""""""""""""""""""""""""""""""""""""""""""""""
$ python grid_search/grid_tool.py p@5 bibtex_leml150.txt
log2l = -6 : 26.885 ( 6 )
log2l = -4 : 26.926 ( 6 )
log2l = -2 : 26.434 ( 8 )
log2l = 0 : 26.291 ( 9 )
log2l = 2 : 26.475 ( 12 )
log2l = 4 : 27.602 ( 15 )
log2l = 6 : 23.361 ( 14 )
$

"""""""""""""""""""""""""""""""""""""""""""""""
Note that you have to finish the example code in Part 4 to obtain bibtex_leml150.txt.
Note that the result may not be exactly the same, due to the randomness in 4:1 train-val split.


==================================
=         One-class MF           =
==================================

One-class MF: Part 1. Data Preparation
======================================

[If you did not run "python download.py" previously]
To obtain the three data sets we used in our experiments ("ml100k", "flixster" and "douban"),
please run "python download.py" under the data/ directory (needs to cd to data/).
It will automatically download the three data sets as three .mat files in data/.
(Actually it will also download the data sets used in Multi-Label Problems)

If you only want to download some particular data sets,
please download from http://www.csie.ntu.edu.tw/~cjlin/papers/ocmf-side/data/.
Then manually put the data set under data/ directory.

Now you can proceed to Part 2.

| How to prepare your own data set?
| 
| The .mat file actually contains a single Matlab structure named "prob".
| And "prob" contains LG (graph laplacian matrix), Y (the observed matrix), and Yt (the testing matrix).
| 
| LG is an m x m sparse PSD matrix and Y, Yt are both m x n sparse matrices (with only 1s and 0s),
| where m is the # of user and n is the # of items.
| 
| Thus by creating the same form of Matlab structure, you are done with your data preparation.


One-class MF: Part 2. Experiments Reproduction
==============================================

This part shows how to reproduce the experimental results shown in our paper.

Since the problems require some hyper-parameter tuning, we apply grid search to find the best parameters.
For applying grid search, please refer to Part 3.

In this part, we have already plugged-in the parameters found after grid search.
Thus by running the following Matlab scripts, you are able to reproduce the numbers shown in our paper.
(note that 16 cores are used by default)
(note that the corresponding data set should be under the data/ directory)

run_ml100k.m
run_flixster.m
run_douban.m

For example, to reproduce the result of ml100k dataset, just type "run_ml100k" in Matlab.
The titles shown in the output correspond to the following formulations

MF: the SQ-SQ formulation without graph information (in our proposed framework)
MF-square: the SQ-wSQ formulation without graph information (in our proposed framework)
MF-logis: the LR-wSQ formulation without graph information (in our proposed framework)
Graph: the SQ-SQ formulation with graph information (in our proposed framework)
Graph-square: the SQ-wSQ formulation with graph information (in our proposed framework)
Graph-logis: the LR-wSQ formulation with graph information (in our proposed framework)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! If you only want to reproduce the numbers in our experiments, !!!
!!!     you can skip Part 3 to Part 5 (for grid search only).     !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


One-class MF: Part 3. Train-Validation Split
============================================

Before applying grid search, we need a validation set to obtain the performance measure of each parameters.
To prepare for the training-validation set for one-class matrix factorization problems (different from multi-label problem),
you can use the following Matlab script: "data/get_val_MF.m".

It will produce a Matlab structure with 4:1 train-val random split on the observed entries,
which is of the same form as the note in this sections's Part 1.
But with Yt being the validation matrix rather than the testing matrix.

For example, the following commands create the train-validation split for the data sets used in our experiment.

"""""""""""""""""""""""""""""""""""""""""""""""
addpath('data/')

load 'ml100k.mat'
ml100k_val = get_val_MF(prob);

load 'flixster.mat'
flixster_val = get_val_MF(prob);

load 'douban.mat'
douban_val = get_val_MF(prob);

"""""""""""""""""""""""""""""""""""""""""""""""


One-class MF: Part 4. Grid Search
=================================

To grid search on one-class matrix factorization problems (with & without graph information),
use the following Matlab scripts in grid_search/.
Before running them, please addpath to data/ and grid_search/.

1) run_mf.m: the SQ-SQ formulation without graph information (in our proposed framework)
2) run_mf_square.m: the SQ-wSQ formulation without graph information (in our proposed framework)
3) run_mf_logis.m: the LR-wSQ formulation without graph information (in our proposed framework)
4) run_graph.m: the SQ-SQ formulation with graph information (in our proposed framework)
5) run_graph_square.m: the SQ-wSQ formulation with graph information (in our proposed framework)
6) run_graph_logis.m: the LR-wSQ formulation with graph information (in our proposed framework)

All of these Matlab functions take two input parameters: "prob" and "K".
prob: a structure containing Y (the observed matrix), LG (graph laplacian matrix) and Yt (the validation matrix).
K: the number of latent variable.

Note that ml100k_val, flixster_val, ... etc. from Part 3, can be directly fed into the first argument of these Matlab functions.

For example, we can do grid search on ml100k data set using SQ-SQ formulation without graph as follows:
(Note that the "diary" command is used to store the lengthy output to a txt file)

"""""""""""""""""""""""""""""""""""""""""""""""
addpath('data/')
addpath('grid_search/')

% Assuming ml100k_val is in Workspace, which is obtained from Part 3.
diary 'ml100k_MF64.txt'
run_mf(ml100k_val, 64) % Using K = 64
diary off

"""""""""""""""""""""""""""""""""""""""""""""""


One-class MF: Part 5. Cleaner Grid Search Result
================================================

To create a cleaner result, "grid_tool.py" can be used in the same way as mentioned previously.

For example, if we want to obtain a cleaner result for grid search on ml100k using SQ-SQ formulation without graph,
just do the following in the command line: (shows the best p@5 performance for each parameter)

"""""""""""""""""""""""""""""""""""""""""""""""
$ python grid_search/grid_tool.py p@5 ml100k_MF64.txt
log2l = -6 : 15.125 ( 10 )
log2l = -4 : 15.19 ( 9 )
log2l = -2 : 15.277 ( 9 )
log2l = 0 : 15.647 ( 9 )
log2l = 2 : 20.196 ( 4 )
log2l = 4 : 20.239 ( 15 )
log2l = 6 : 12.622 ( 3 )
$

"""""""""""""""""""""""""""""""""""""""""""""""
Note that you have to finish the example code in Part 4 to create ml100k_MF64.txt.
Note that the result may not be exactly the same, due to the randomness in 4:1 train-val split.

