LibShortText is an open source library for short-text classification. Please read the COPYRIGHT file before using LibShortText.
To get started, please read Quick Start first.
LibShortText requires UNIX systems with Python 2.6 or newer versions. The latest version (Python 2.7) is recommended for better efficiency.
On Unix systems, type
make
to install the package. For training and test data, every line in the file contains a label and a short text in the following format:
<label><TAB><text>
A TAB character is between <label> and <text>. Both the label and the text can contains space characters. Here are some examples.
Jewelry & Watches<TAB>handcrafted two strand multi color bead necklaceBooks<TAB>big bike magazine february 1973
Two sample sets included in this package are ‘train_file’ and ‘test_file’.
LibShortText provides a simple training-prediction workflow:
The command ‘text-train.py’ trains a text set to obtain a model. For example, the following command generates ‘train_file.model’ for the given train_file.
python text-train.py train_file
[output skipped]
text-predict.py predicts a test file using the trained model. For example, the following command predicts test_file with train_file.model and stores the results in predict_result.
python text-predict.py test_file train_file.model predict_result
Accuracy = 87.1800% (4359/5000)
Once predict_result is obtained, LibShortText provides several handy utilities to conduct error analysis in the Python interactive shell. Please see Interactive Error Analysis for more details.
text-train.py obtains a model by training either a short-text dataset or a LIBSVM-format data set generated by text2svm.py.
Usage: text-train.py [options] training_file [model]
options | description |
---|---|
-P {0|1|2|3|4|5|6|7|converter_directory} | Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1)
If a preprocessor directory is given instead, then it is assumed that the training data is already in LIBSVM format. The preprocessor will be included in the model for test. |
-G {0|1} | Grid search for the parameter C in linear classifiers. (default 0)
|
-F {0|1|2|3} | Feature representation. (default 0)
|
-N {0|1} | Instance-wise normalization before training/test. (default 1 to conduct normalization) |
-L {0|1|2|3} | Classifier. (default 0)
|
-A extra_svm_file | Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. |
-f | Overwrite the existing model file. |
Examples:
text-train.py -L 3 -F 1 -N 1 raw_text_file model_file
text-train.py -P text2svm_converter -L 1 converted_svm_file
text-predict.py predicts labels for a test dataset with a trained model.
Usage: text-predict.py [options] test_file model output
options | description |
---|---|
-f | Overwrite the existing output file. |
-a {0|1} | Output options. (default 1)
|
-A extra_svm_file | Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. |
text2svm.py generates a directory containing needed information for converting short texts to LIBSVM format. An output file in LIBSVM format is also generated.
Usage: text2svm.py [options] text_src [output]
options | description |
---|---|
-P {0|1|2|3|4|5|6|7} | Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1)
|
-A extra_svm_file | Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended. |
Default output will be a file “text_src.svm” and a directory “text_src.text_converter.” If “output” is specified, the output will be “output.svm” and “output.text_converter.”
We use the following questions/answers to demonstrate some examples.
Although text-train.py has several parameters to tune, we carefully choose default parameters based on a study on short-text classification [2]. Running text-train.py without parameters can deliver good classification accuracy in general. It is equivalent to the following command, in which default parameters are explicitly specified.
python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file
Meaning for each parameter:
parameters | description |
---|---|
-P “-stemming 0 -stopword 0 -feature 1” | no stemming, no stopword removal, bigram features |
-G null | no LIBLINEAR parameter selection |
-F 0 | binary feature representation |
-N 1 | each instance is normalized to unit length |
-L “-s 4 -c 1 -B -1” | use Crammer and Singer’s multi-class method, set the parameter C to 1, and no bias term is added |
By default, LIBLINEAR (and text-train.py) sets the parameter C to 1. You can automatically select the best parameter C by using -G 1.
Internally, text-train.py converts data to LIBSVM format and applies LIBLINEAR for training. To reuse the pre-processed data, LibShortText provides another workflow:
The following command generates a LIBSVM-format file train_file.svm and a directory train_file.text_converter containing information for the conversion.
python text2svm.py train_file
[ train_file.text_converter and train_file.svm are generated. ]
We then generate two models using the same LIBSVM-format file.
python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model
[ A logistic regression model, lr.model, is generated. ]
python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model
[ An L2-loss linear SVM model, l2svm.model, is generated. ]
If the specified model or output file exists, by default, neither text-train.py nor text-predict.py overwrite them. You can generate new models/prediction outputs by “-f”.
python text-train.py -f train_file
python text-predict.py -f test_file train_file.model predict_result
By default, some additional information for analysis are stored. If you need to get only classification accuracy, you can specify “-a 0” to save disk space. For example,
python text-predict.py -a 0 test_file train_file.model predict_result
For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted string after “-L” with a special character “@”. For example, if you want to use L2-regularized Logistic Regression as the classifier, set the parameter C to 0.5, and append a bias term to each instance, you can type
python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file
To show parameters provided by LIBLINEAR/grid, use
python text-train.py -x liblinear
python text-train.py -x grid
For “grid.py”, to specify the range of C, using ‘-G @”-log2c begin,end,step”’. For example, the following command selects the best C among [ 2 -2, 2 -1, 2 0, 2 1 ] in terms of cross validation rates.
python text-train.py -G @"-log2c -2,1,1" train_file
You can use the -A option in the command line mode. For example, if you have two extra svm files extra_train_1 and extra_train_2 in LIBSVM-format, then use:
python text-train.py train_file -A extra_train_1 -A extra_train_2
Note that train_file, extra_train_1, and extra_train_2 should have the same number of instances. And then use the following command to predict
python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result
We provide interactive tools to analyze prediction results. First, you generate a file of prediction results by the commands introduced in section Quick Start. Note that you CANNOT specify ‘-a 0’ to text-predict.py or the prediction result will not be analyzable.
You then enter Python, import the module, load the prediction results, and create an object of Analyzer by reading a model.
python
>>> from libshorttext.analyzer import *
>>> predict_result = InstanceSet('predict_result')
>>> analyzer = Analyzer('train_file.model')
You can select a subset of test data for analysis using the following options.
options | description |
---|---|
wrong | Select wrongly predicted instances. |
with_labels(labels, target) | If target is 'true', then instances with labels in the set labels are selected. If target is 'predict', those predicted to be in labels are chosen. target can also be 'both' or 'or'. 'both' and 'or' find the union and the intersection of 'true' and 'predict', respectively. The default value of target is 'both'. |
sort_by_dec | Sort instances by decision values. |
subset(amount, method) | Get a specific amount of data by the method top or random. The default value of method is top. |
For example, among wrongly predicted instances with labels ‘Books’, ‘Music’, ‘Art’, and ‘Baby’, to get those having the highest 100 decision values, you can use
>>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100))
You can run the following operations to know details of the selected instances.
>>> analyzer.info(insts)
Number of instances: 100
Accuracy: 0.0 (0/100)
True labels: "Baby" "Art" "Books" "Music"
Predicted labels: "Baby" "Music" "Books" "Art"
Text source: /home/user/libshorttext-1.0/test_file
Selectors:
-> Select wronly predicted instances
-> labels: "Books", "Music", "Art", "Baby"
-> Sort by maximum decision values.
-> Select 100 instances in top.
The following command generates a confusion table on the selected instances:
>>> analyzer.gen_confusion_table(insts)
Art Books Music Baby
Art 0 15 4 5
Books 10 0 17 3
Music 10 21 0 3
Baby 1 7 4 0
To analyze a single short text, you first load it by
>>> insts.load_text()
Then you can print information for each single text in insts.
>>> print(insts[61])
text = avengers assemble 4 panini uk collector s edition nm 2012
true label = Books
predicted label = Music
You can print model weights corresponding to tokens of a short text. The following operation prints weights of the three classes with the highest decision values. (To print weights in all classes, you can change 3 to 0.)
>>> analyzer.analyze_single(insts[61], 3)
Music Books Antiques
edition -5.232e-02 8.869e-01 -1.303e-01
s edition -2.219e-02 1.527e-01 -4.077e-02
nm 7.269e-01 6.048e-02 -1.495e-01
collector -5.253e-02 -5.208e-02 8.804e-02
uk 9.466e-01 -2.089e-01 2.683e-02
collector s -3.174e-02 6.389e-02 9.963e-02
4 -2.011e-01 -2.062e-01 1.526e-01
2012 -1.173e-01 2.663e-01 -1.369e-01
s -5.142e-02 1.485e-01 1.757e-01
**decval** 3.816e-01 3.705e-01 2.842e-02
True label: Books
You can also analyze an arbitrary short text.
>>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3)
Music Crafts Travel
sealed 4.828e-01 1.050e-03 -5.383e-02
cd 2.872e+00 -1.032e-01 -1.723e-01
cd single 1.663e-01 -5.181e-03 -6.558e-03
single 4.375e-01 -6.953e-02 -9.960e-02
usa 2.247e-01 3.530e-02 2.657e-02
beatles 5.050e-01 -5.710e-02 -6.933e-02
3 cd 1.320e-02 -3.837e-02 -7.793e-20
3 3.057e-02 4.712e-02 1.402e-01
**decval** 1.673e+00 -6.716e-02 -8.299e-02
[1] | H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library for Short-text Classification. |
[2] | H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product title classification versus text classification. |
For any questions and comments, please email cjlin@csie.ntu.edu.tw.