View this tutorial in: English Only    TraditionalChinese Only    Both (Default)   (req. JavaScript if you want to switch languages)
Core StyleSheets: Chocolate  Midnight  Modernist  Oldstyle  Steely  Swiss  Traditional  Ultramarine
* This document is written in multilingual format. We strongly suggest that you choose your language first to get a better display.

# piaip's Using (lib)SVM Tutorial piaip 的 (lib)SVM 簡易入門

piaip at csie dot ntu dot edu dot tw,
Hung-Te Lin
Fri Apr 18 15:04:53 CST 2003
\$Id: svm_tutorial.html,v 1.13 2007/10/02 05:51:55 piaip Exp piaip \$ 原作：林弘德，轉載請保留原出處

## Why this tutorial is here

kcwu, biboshen, puffer, somi

## SVM: What is it and what can it do for me?

SVM, Support Vector Machine , 簡而言之它是個起源跟類神經網路有點像的東西， 不過現今最常拿來就是做分類 (classification)。 也就是說，如果我有一堆已經分好類的東西 （可是分類的依據是未知的！） ，那當收到新的東西時， SVM 可以預測 (predict) 新的資料要分到哪一堆去。 SVM, Support Vector Machine , is something that has similar roots with neural networks. But recently it has been widely used in Classification. That means, if I have some sets of things classified (But you know nothing about HOW I CLASSIFIED THEM, or say you don't know the rules used for classification), when a new data comes, SVM can PREDICT which set it should belong to.

To get yourself more familiar with SVM, you may refer to the slides cjlin used in his Data Mining course : pdf or ps .
I'm going to try to explain and use libSVM without those slides.

## How do I get SVM?

### Download libsvm

.zip 跟 .tar.gz 基本上是一樣的, 只是看你的 OS; 習慣上 Windows 用 .zip 比較方便 (因為有 WinZIP, 不過我都用 WinRAR), UNIX 則是用 .tar.gz Contents in the .zip and .tar.gz are the same. People using Windows usually like to use .zip files because they have WinZIP, which I always replace with WinRAR. UNIX users mostly prefer .tar.gz

### Build libsvm

Windows 的用戶要自己重編當然也是可以, 不過已經有編好的 binary 在裡面了: 請檢查 windows 子目錄, 應該會有 svmtrain.exe, svmscale.exe, svmpredict.exe, svmtoy.exe . Windows users may rebuild from source if you want, but there're already some prebuilt binaries in the archive: just check your "windows" subdirectory and you should find svmtrain.exe, svmscale.exe, svmpredict.exe, and svmtoy.exe .

## Using SVM

libsvm 有很多種用法, 這篇 tutorial 只打算講簡單的部分. libsvm has lots of functions. This tutorial will only explain the easier parts (mostly classification with default model).

### The programs

svmtrain
Train (訓練) data. 跑 SVM 被戲稱為 "開火車" 也是由於這個程式名而來. train 會接受特定格式的輸入, 產生一個 "Model" 檔. 這個 model 你可以想像成 SVM 的內部資料, 因為 predict 要 model 才能 predict, 不能直接吃原始資料. 想想也很合理, 假定 train 本身是很耗時的動作, 而 train 好可以以某種形式存起內部資料, 那下次要 predict 時直接把那些內部資料 load 進來就快多了. Use your data for training. Running SVM is often referred to as 'driving trains' by its non-native English speaking authors because of this program. svmtrain accepts some specifically format which will be explained below and then generate a 'Model' file. You may think of a 'Model' as a storage format for the internal data of SVM. This should appear very reasonable after some thought, since training with data is a time-consuming process, so we 'train' first and store the result enabling the 'predict' operation to go much faster.
svmpredict

svmscale
Rescale data. 因為原始資料可能範圍過大或過小, svmscale 可以先將資料重新 scale (縮放) 到適當範圍. Rescale data. The original data maybe too huge or small in range, thus we can rescale them to the proper range so that training and predicting will be faster.

### File Format

``` [label] [index1]:[value1] [index2]:[value2] ... [label] [index1]:[value1] [index2]:[value2] ... . . ```

+1 1:0.708 2:1 3:1 4:-0.320 5:-0.105 6:-1

label

index

value

## To Run libsvm

1. 準備資料並做成指定格式 (有必要時需 svmscale) Prepare data in specified format and svmscale it if necessary.
2. 用 svmtrain 來 train 成 model Train the data to create a model with svmtrain.
3. 對新的輸入，使用 svmpredict 來 predict 新資料的 class Predict new input data with svmpredict and get the result.

### svmtrain

svmtrain 的語法大致就是:

The syntax of svmtrain is basically:

svmtrain [options] training_set_file [model_file]

training_set_file 就是之前的格式，而 model_file 如果不給就會 叫 [training_set_file].model。 options 可以先不要給。

The format of training_set_files is described above. If the model_file is not specified, it'll be [training_set_file].model by default. Options can be ignored at first.

``` ./svm-train heart_scale optimization finished, #iter = 219 nu = 0.431030 obj = -100.877286, rho = 0.424632 nSV = 132, nBSV = 107 Total nSV = 132```

### svmpredict

svmpredict 的語法是 : The syntax to svm-predict is:

svmpredict test_file model_file output_file

test_file 就是我們要 predict 的資料。它的格式跟 svmtrain 的輸入，也就是 training_set_file 是一樣的！ predict 完會順便拿 predict 出來的值跟 test_file 裡面寫的值去做比對，這代表： test_file 寫的 label 是真正的分類結果，拿來跟我們 predict 的結果比對就可以 知道 predict 有沒有猜對了。 test_file is the data the we are going to 'predict'. Its format is almost exactly the same as the training_set_file, which we fed as input to svmtrain. After predicting svm-predict will compare the predicted label with the label written in test_file. That means, test_file has the real (or correct) result of classification, and after comparing with our predicted result we can know whether the prediction is correct or not.

``` ./svm-predict heart_scale heart_scale.model heart_scale.out Accuracy = 86.6667% (234/270) (classification) Mean squared error = 0.533333 (regression) Squared correlation coefficient = 0.532639(regression) ```

As you can see，我們把原輸入丟回去 predict， 第一行的 Accuracy 就是預測的正確率了。 如果輸入沒有 label 的話，那就是真的 predict 了。 As you can see, after we 'predict'ed the original input, we got 'Accuracy=86.6667%" on first line as accuracy of prediction. If we don't put labels in input, the result is real prediction.

## Advanced Topics

### Scaling

svm-scale 目前不太好用，不過它有其必要性。因為 適當的scale有助於參數的選擇(後述)還有解svm的速度。
svmscale 會對每個 attribute 做scale。 範圍用 -l, -u 指定， 通常是[0,1]或是[-1,1]。 輸出在 stdout。

svm-scale is not easy to use right now, but it is important. Scaling aids the choosing of arguments (described below) and the speed of solving SVM.
svmscale rescales all atrributes with the specified (by -l, -u) range, usually [0,1] or [-1,1].
Please keep in mind that testing data and training data MUST BE SCALED WITH THE SAME RANGE. Don't forget to scale your testing data before you predict.
We can't specify the testing and training data file together and scale them in one command, that's why svm-scale is not so easy to use right now.

### Arguments

`./svm-train -c 10 heart_scale`

We know that we can use some arguments when we were training data (Running svm-train without any input file or arguments will cause it to print its list syntax help and complete arguments). These arguments corresponds to some arguments in original SVM equations so they will affect the accuracy of prediction.
Let's use c=10 as an example:
`./svm-train -c 10 heart_scale`
If you predict again now, the accuracy will be 92.2% (249/270).

#### Cross Validation

1. 先有已分好類的一堆資料
2. 亂數拆成好幾組 training set
3. 用某組參數去 train 並 predict 別組看正確率
4. 正確率不夠的話，換參數再重複 train/predict
Mostly people use SVM while following this workflow:
1. Prepare lots of pre-classified (correct) data
2. Split them into several training sets randomly.
3. Train with some arguments and predict other sets of data to calculate the accuracy.
4. Change the arguments and repeat until we get good accuracy.

-v n: n-fold cross validation
n 就是要拆成幾組，像 n=3 就會拆成三組，然後先拿 1跟2來 train model 並 predict 3 以得到正確率； 再來拿 2跟 3 train 並 predict 1，最後 1,3 train 並 predict 2。其它以此類推。
In the process of experimenting with the arguments, we can use the built-in support for validation of svmtrain:
-v n: n-fold cross validation
n is how many sets to split your input data. Specifing n=3 will split data into 3 sets; train the model with data set 1 and 2 first then predict data set 3 to get the accuracy, then train with data set 2 and 3 and predict data set 1, finally train 1,3 and predict 2, ... ad infinitum.

#### What arguments rules?

cost 預設值是 1, gamma 預設值是 1/k ，k 等於輸入 資料筆數。 那我們怎麼知道要用多少來當參數呢？

用 試 的

cost is 1 by default, and gamma has default value = 1/k , k = number of input records. Then how do we know what value to choose as arguments?

T R Y
Yes. Just by trial and error.

Try 參數的過程常用 exponential 指數成長的方式來增加與減少參數的數值， 也就是 2^n (2 的 n 次方)。 When experimenting with arguments, the value usually increases and decreases in exponential order. i.e., 2^n.

### Regression

The other important issue is "Regression".

To explain briefly, we only used SVM to do classification in this tutorial. The type of label we used are always discrete data (ie. a known fixed value). "Regression" in this context means to predict labels with continuous values (or unknown values). You can think of classification as predictions with only binary outcomes, and regression as predictions that output real (floating point) numbers.

Thus to predict lottery numbers (since they are always fixed numbers) you should use classification, and to predict the stock market you need regression.

The labels must also be scaled when you use regression, by ```svm-scale -y lower upper ```

However grid.py does not support regression, and cross validation sometimes does not work well with regression.

Regression is interesting but also advanced. Please refer to other documents for details.

## Copyright

#### All rights reserved by Hung-Te Lin (林弘德, piaip), ， Website: piaip at ntu csie ，2003.

All HTML/text typed within VIM on Solaris.
Style sheet from W3C Core StyleSheets.