Introduction
============

svmprob is a package for multiclass probability outputs based on 
libsvm and randomForest.  Details of methods implemented here are 
described in 

T.-F. Wu, C.-J. Lin, and R. C. Weng. 
Probability Estimates for Multi-class Classification by Pairwise Coupling.  
September, 2003. A short version appears in NIPS 2003.
http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf

Dataset used in this paper can be download from 
http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/data/

./easy_rf.py and ./easy_svm.py are the top-level script 
initiating the experiments. You may want to start from them.

Quick Install 
=============
The installation steps are tested in Ubuntu 9.10 but should generalize
to other Linux/UN*X systems. Use [unitest] to check the corresponding 
installation.

o   numpy package
    apt-get install python-numpy
    [unitest] ./coupler.py 


o   R, rpy, randomForest
    apt-get install python-rpy 
    apt-get install r-base-core
    inside R prompt, type install.packages('randomForest');
    [unitest] ./rndForest.py

o   libsvm, libsvm python interface
    
   -Install as using package
    apt-get install libsvm-python

   -Install from source
    download libsvm standard package and compile the python-interface.
    Add /path_to/libsvm-2.4-or-higher/python/ into environment variable PYTHONPATH
    (eg. export PYTHONPATH=$PYTHONPATH:~/libsvm-2.5/python)
    Make sure that python interface works by python/svm_test.py in libsvm

    [unitest] ./svmPlatt.py

o   Some of basic modules is self-unitested. You can simply run it directly to
    test if evey module is installed properly. Try make test after installation.
    $ make test


Usage of svmprob
================

General Note:
all options begins with - (dash, -c Cost) must specified before non-dash options
such as <training file>. The inter-order of 'dash options' is irrelevant.

-svm "SVM parameters" is explicitly double quoted. you should surround svm
parameters by double quote. eg. -svm "-c 1000 -g 10"

---------------
[svmprob.py]
Introduction:
    Basic implementation of Platt's binary probability output SVM.
    Input training file should consist exactly 1, -1 labeles 
    The model parameter can be obtained using generic SVM grid-search
Synopsis: 
    ./svmprob.py [svm-train parameters][-o outfile][-T test_file] train_file

Example:
    Get probabiliy prediction using decision values from 5-fold 
    cross validation to hear_scale.

    $ ./svmprob.py -c 8 -g .00781250 -v 5 -o probout ./heart_scale

Screen Output: 
    Accuracy 0.855556
File Output:'probout' (specified after -o option)
# <probability> <decision value>
    0.961350        1.970946
    0.046899       -1.901695
    0.119076       -1.273093
    0.677209        0.432703
    ... ...

[multiprob.py]
Introduction:
    Multiclass probability output based on libsvm and randomForest(optional)
Synopsis:
    ./multiprob.py 
    [-couplers {ht,pkpd,markov,*minpair,vote}]    #couplers separate by comma
    [-base {*svmplatt,rf,svmdeci}]                #binary probability classifier
    [-cv cv_fold][-logfile logfile][-seed rndseed]#general parameters
    [-svm "SVM Parameters"][-c cost][-g gamma]    #SVM related parameters
    [-mtry mtry]                                  #randomForest parameters
    [-o <outfile>][-T <testfile>] <train_file>    #test and traing files

	couplers can be multiple, for example -couplers ht,minpair,pkpd
	base must be single, for example -base svmplatt

Example:
	using a 3-class data split from 'dna.scale'
    $./multiprob.py -o out -couplers pkpd,minpair  -T dna.scale.t-1 dna.scale-1
Screen Output:
   Using default couplers ['minpair']
   Classes dispatched: [1.0, 2.0, 3.0]
   ..........................................vote[1.0, 2.0] 
   (many lines of dot)
   SubProb Accuracy pkpd(442/500) 88.400000% MSE 0.055218 Logloss -0.298454
   SubProb Accuracy minpair(442/500) 88.400000% MSE 0.055312 Logloss -0.296751
   Accuracies: [('pkpd', 0.88400000000000001), ('minpair', 0.88400000000000001)]
   LogLoss: [('pkpd', -0.29845368614993906), ('minpair', -0.29675127897858677)]
   Testing MSE 0.055218 Accuracy 88.400000% #Test Accuracy for first coupler

File Output: 'out' (specified after -o option)
   There will be #(couplers specified)*len(testdata) lines in the ouput
        <coupler>   <prob of class1>    <prob of class2>    <prob of class3>
$ head out
#1      pkpd        0.123326144664      0.867096109690      0.00957774564611
        minpair     0.122957023457      0.861420055514      0.0156229210295
#2      pkpd        0.192141759416      0.783449441711      0.0244087988732
        minpair     0.191512274526      0.785991592476      0.0224961329982
    ... ...

[easy_svm.py]
Introduction:
    This is a full-automatic script that determines model parameter of libsvm
    (c, gamma) by conducting cross validation and then predicting the test
    data with the best model parameter.
Synopsis:
    ./easy_svm.py [-couplers {ht,pkpd,markov,*minpair,vote}][-s rndseed]
    [-pcv platt_cv_fold][-cv full_fold] <train_file> <testing_file>
    this program will automatically determine best (c,g)

Example:
    Test and Train heart_scale using default coupler and base. 
    ./easy_svm heart_scale heart_scale
Screen Outputs:
    Cross Validation [...]
    ............
    (omitted)
    Testing [...]
    ...........
    (omitted)
    Testing MSE 0.105901 Accuracy 85.555556%
    Output Result: heart_scale-cv-test
File Output:
    letter.scale-0-cv: lines of (c,g, accuracy, mse, coupler) result during CV
                    <training
    <cost>  <gamma>  accuracy>  <MSE>       <coupler used>
    5.00    -7.00   0.829630    0.119951    minpair
    -1.00   -7.00   0.829630    0.126644    minpair
    5.00    -1.00   0.759259    0.157541    minpair
    -1.00   -1.00   0.796296    0.144061    minpair
    ...
    letter.scale-0-cv-test: (c,g, accuracy, mse, coupler) result of each test
                    <training
    <cost>  <gamma>  accuracy>  <MSE>       <coupler used>
    3.00    -7.00   0.855556    0.105901    minpair

[easy_rf.py]
Introduction:
    The same as easy_svm.py, but we are using randomForest as the binary 
    classifier.  It would select the best mtry from the following list, and 
    use it to predict.
    [1, sqrt(#attr), #attr/3, #attr/2, #attr]
Synopsis:
    ./easy_rf.py [-couplers {ht,pkpd,markov,*minpair,vote}]
    [-s rndseed][-cv full_fold] <train_file> <testing_file>
Example & Output
    Similiar to [easy_svm.py], but uses randomforest as the binary classifer.

[rfcv.py]
Synopsis:
    ./rfcv.py [-logfile logfile][-cv nr_fold] <train_file> <test_file>

Example:
    ./rfcv.py heart_scale heart_scale
Screen Output:
#five fold cross-validation, mtry = 1 at first attempt
#'*' means incorrect prediction and '.' means correct
    0..**.*....**.*................**.....*..*...*.**......
    1....*.....*..*.*......*.......*..*...*.....*...*...*..
    2....*.............................*.*.*......*......**
    3.......*.......*....*....*............*.........*.....
    4..................*.............*...*.....*...*.*.....
    CV:mtry = 1, error = 43
    #After CV done, the best mtry is used to do the Testing and output result
    .....
    (omitted)
    Testing pure-RF MSE 0.124311 Accuracy 94.444444% mtry=1


Framework of svmprob
====================
A. common routines encapured in a single file
+------------------+
|  Common library  |  svmprob.py
+------------------+

B. multiclass probability output by pairwise coupling
(depend on svmprob.py)
+------------------------------------+
|  HT PKPD MARKOV MINPAIR VOTE       |          coupler.py
+------------------------------------+
|           | svmPlatt | svmDeci     |  rndForest.py | svmPlatt.py
| rndForest +------------------------+               |
|           | libsvm with decival    |               | svm.py
+-----------+------------------------+
|   Integrated Trainer & Tester      |  multiprob.py
+------------------------------------+
|  automatic tuning and test script  |  easy_rf.py  easy_svm.py
+------------------------------------+

C. python interface bridge to original randomForest
(this depends on svmprob.py and rndForest.py)
+-----------+
|  rfcv.py  |
+-----------+ 
