Overview of LibShortText

LibShortText is an open source library for short-text classification. Please read the COPYRIGHT file before using LibShortText.

To get started, please read Quick Start first.

Installation and Data Format

LibShortText requires UNIX systems with Python 2.6 or newer versions. The latest version (Python 2.7) is recommended for better efficiency.

On Unix systems, type

make

to install the package. For training and test data, every line in the file contains a label and a short text in the following format:

<label><TAB><text>

A TAB character is between <label> and <text>. Both the label and the text can contains space characters. Here are some examples.

Jewelry & Watches<TAB>handcrafted two strand multi color bead necklace
Books<TAB>big bike magazine february 1973

Two sample sets included in this package are ‘train_file’ and ‘test_file’.

Quick Start

LibShortText provides a simple training-prediction workflow:

digraph workflow1 {
"short texts" -> "model"
[label="text-train.py"];
"model"-> "predictions"
[label="text-prediction.py"];
}

The command ‘text-train.py’ trains a text set to obtain a model. For example, the following command generates ‘train_file.model’ for the given train_file.

python text-train.py train_file
[output skipped]

text-predict.py predicts a test file using the trained model. For example, the following command predicts test_file with train_file.model and stores the results in predict_result.

python text-predict.py test_file train_file.model predict_result
Accuracy = 87.1800% (4359/5000)

Once predict_result is obtained, LibShortText provides several handy utilities to conduct error analysis in the Python interactive shell. Please see Interactive Error Analysis for more details.

Command-line Usage

text-train.py

text-train.py obtains a model by training either a short-text dataset or a LIBSVM-format data set generated by text2svm.py.

Usage: text-train.py [options] training_file [model]

options description
-P {0|1|2|3|4|5|6|7|converter_directory}

Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1)

  • 0: no stopword removal, no stemming, unigram
  • 1: no stopword removal, no stemming, bigram
  • 2: no stopword removal, stemming, unigram
  • 3: no stopword removal, stemming, bigram
  • 4: stopword removal, no stemming, unigram
  • 5: stopword removal, no stemming, bigram
  • 6: stopword removal, stemming, unigram
  • 7: stopword removal, stemming, bigram

If a preprocessor directory is given instead, then it is assumed that the training data is already in LIBSVM format. The preprocessor will be included in the model for test.

-G {0|1}

Grid search for the parameter C in linear classifiers. (default 0)

  • 0: disable grid search (faster)
  • 1: enable grid search (slightly better results)
-F {0|1|2|3}

Feature representation. (default 0)

  • 0: binary feature
  • 1: word count
  • 2: term frequency
  • 3: TF-IDF (term frequency + IDF)
-N {0|1} Instance-wise normalization before training/test. (default 1 to conduct normalization)
-L {0|1|2|3}

Classifier. (default 0)

  • 0: support vector classification by Crammer and Singer
  • 1: L1-loss support vector classification
  • 2: L2-loss support vector classification
  • 3: logistic regression
-A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended.
-f Overwrite the existing model file.

Examples:

text-train.py -L 3 -F 1 -N 1 raw_text_file model_file
text-train.py -P text2svm_converter -L 1 converted_svm_file

text-predict.py

text-predict.py predicts labels for a test dataset with a trained model.

Usage: text-predict.py [options] test_file model output

options description
-f Overwrite the existing output file.
-a {0|1}

Output options. (default 1)

  • 0: Store only predicted labels. The information is NOT sufficient for interactive analysis. Use this option if you would like to get only accuracy.
  • 1: More information is stored. The output provides information for interactive analysis, but the size of output can become much larger.
-A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended.

text2svm.py

text2svm.py generates a directory containing needed information for converting short texts to LIBSVM format. An output file in LIBSVM format is also generated.

Usage: text2svm.py [options] text_src [output]

options description
-P {0|1|2|3|4|5|6|7}

Preprocessor options. The options include stopwrod removal, stemming, and bigram. (default 1)

  • 0: no stopword removal, no stemming, unigram
  • 1: no stopword removal, no stemming, bigram
  • 2: no stopword removal, stemming, unigram
  • 3: no stopword removal, stemming, bigram
  • 4: stopword removal, no stemming, unigram
  • 5: stopword removal, no stemming, bigram
  • 6: stopword removal, stemming, unigram
  • 7: stopword removal, stemming, bigram
-A extra_svm_file Append extra libsvm-format data. This parameter can be applied many times if more than one extra svm-format data set need to be appended.

Default output will be a file “text_src.svm” and a directory “text_src.text_converter.” If “output” is specified, the output will be “output.svm” and “output.text_converter.”

More Examples about Command-line Usage

We use the following questions/answers to demonstrate some examples.

Q: Given many parameters provided by text-train.py, how to choose the parameters at the first trial?

Although text-train.py has several parameters to tune, we carefully choose default parameters based on a study on short-text classification [2]. Running text-train.py without parameters can deliver good classification accuracy in general. It is equivalent to the following command, in which default parameters are explicitly specified.

python text-train.py -P 1 -G 0 -F 0 -N 1 -L 0 train_file

Meaning for each parameter:

parameters description
-P “-stemming 0 -stopword 0 -feature 1” no stemming, no stopword removal, bigram features
-G null no LIBLINEAR parameter selection
-F 0 binary feature representation
-N 1 each instance is normalized to unit length
-L “-s 4 -c 1 -B -1” use Crammer and Singer’s multi-class method, set the parameter C to 1, and no bias term is added

Q: How to select the parameter C in LIBLINEAR automatically?

By default, LIBLINEAR (and text-train.py) sets the parameter C to 1. You can automatically select the best parameter C by using -G 1.

Q: How to generate different models using the same training data?

Internally, text-train.py converts data to LIBSVM format and applies LIBLINEAR for training. To reuse the pre-processed data, LibShortText provides another workflow:

digraph workflow2 {
"short text" -> "LIBSVM format data"
[label="text2svm.py"];
"LIBSVM format data"-> "model"
[label="text-train.py"];
"model" -> "result"
[label="text-predict.py"];
}

The following command generates a LIBSVM-format file train_file.svm and a directory train_file.text_converter containing information for the conversion.

python text2svm.py train_file

[ train_file.text_converter and train_file.svm are generated. ]

We then generate two models using the same LIBSVM-format file.

python text-train.py -P train_file.text_converter -L 3 train_file.svm lr.model

[ A logistic regression model, lr.model, is generated. ]

python text-train.py -P train_file.text_converter -L 2 train_file.svm l2svm.model

[ An L2-loss linear SVM model, l2svm.model, is generated. ]

Q: How to overwrite existing models or prediction results?

If the specified model or output file exists, by default, neither text-train.py nor text-predict.py overwrite them. You can generate new models/prediction outputs by “-f”.

python text-train.py -f train_file
python text-predict.py -f test_file train_file.model predict_result

Q: Why is the file of prediction results so large?

By default, some additional information for analysis are stored. If you need to get only classification accuracy, you can specify “-a 0” to save disk space. For example,

python text-predict.py -a 0 test_file train_file.model predict_result

Q: If I am an experienced LIBILNEAR user, how should I specify options for LIBLINEAR and “grid.py”?

For LIBLINEAR, you can easily pass LIBLINEAR parameters in a double quoted string after “-L” with a special character “@”. For example, if you want to use L2-regularized Logistic Regression as the classifier, set the parameter C to 0.5, and append a bias term to each instance, you can type

python text-train.py -L @"-s 3 -c 0.5 -B 1" train_file

To show parameters provided by LIBLINEAR/grid, use

python text-train.py -x liblinear
python text-train.py -x grid

For “grid.py”, to specify the range of C, using ‘-G @”-log2c begin,end,step”’. For example, the following command selects the best C among [ 2 -2, 2 -1, 2 0, 2 1 ] in terms of cross validation rates.

python text-train.py -G @"-log2c -2,1,1" train_file

Q: I already have some LIBSVM-format features. How can I include these features when training the model?

You can use the -A option in the command line mode. For example, if you have two extra svm files extra_train_1 and extra_train_2 in LIBSVM-format, then use:

python text-train.py train_file -A extra_train_1 -A extra_train_2

Note that train_file, extra_train_1, and extra_train_2 should have the same number of instances. And then use the following command to predict

python text-predict.py test_file -A extra_test_1 -A extra_test_2 train_file.model predict_result

Interactive Error Analysis

We provide interactive tools to analyze prediction results. First, you generate a file of prediction results by the commands introduced in section Quick Start. Note that you CANNOT specify ‘-a 0’ to text-predict.py or the prediction result will not be analyzable.

You then enter Python, import the module, load the prediction results, and create an object of Analyzer by reading a model.

python
>>> from libshorttext.analyzer import *
>>> predict_result = InstanceSet('predict_result')
>>> analyzer = Analyzer('train_file.model')

You can select a subset of test data for analysis using the following options.

options description
wrong Select wrongly predicted instances.
with_labels(labels, target) If target is 'true', then instances with labels in the set labels are selected. If target is 'predict', those predicted to be in labels are chosen. target can also be 'both' or 'or'. 'both' and 'or' find the union and the intersection of 'true' and 'predict', respectively. The default value of target is 'both'.
sort_by_dec Sort instances by decision values.
subset(amount, method) Get a specific amount of data by the method top or random. The default value of method is top.

For example, among wrongly predicted instances with labels ‘Books’, ‘Music’, ‘Art’, and ‘Baby’, to get those having the highest 100 decision values, you can use

>>> insts = predict_result.select(wrong, with_labels(['Books', 'Music', 'Art', 'Baby']), sort_by_dec, subset(100))

You can run the following operations to know details of the selected instances.

>>> analyzer.info(insts)
Number of instances: 100
Accuracy: 0.0 (0/100)
True labels: "Baby"  "Art"  "Books"  "Music"
Predicted labels: "Baby"  "Music"  "Books"  "Art"
Text source: /home/user/libshorttext-1.0/test_file
Selectors:
-> Select wronly predicted instances
-> labels: "Books", "Music", "Art", "Baby"
-> Sort by maximum decision values.
-> Select 100 instances in top.

The following command generates a confusion table on the selected instances:

>>> analyzer.gen_confusion_table(insts)
         Art  Books  Music  Baby
Art        0     15      4     5
Books     10      0     17     3
Music     10     21      0     3
Baby       1      7      4     0

To analyze a single short text, you first load it by

>>> insts.load_text()

Then you can print information for each single text in insts.

>>> print(insts[61])
text = avengers assemble 4 panini uk collector s edition nm 2012
true label = Books
predicted label = Music

You can print model weights corresponding to tokens of a short text. The following operation prints weights of the three classes with the highest decision values. (To print weights in all classes, you can change 3 to 0.)

>>> analyzer.analyze_single(insts[61], 3)
                    Music       Books    Antiques
edition        -5.232e-02   8.869e-01  -1.303e-01
s edition      -2.219e-02   1.527e-01  -4.077e-02
nm              7.269e-01   6.048e-02  -1.495e-01
collector      -5.253e-02  -5.208e-02   8.804e-02
uk              9.466e-01  -2.089e-01   2.683e-02
collector s    -3.174e-02   6.389e-02   9.963e-02
4              -2.011e-01  -2.062e-01   1.526e-01
2012           -1.173e-01   2.663e-01  -1.369e-01
s              -5.142e-02   1.485e-01   1.757e-01
**decval**      3.816e-01   3.705e-01   2.842e-02
True label: Books

You can also analyze an arbitrary short text.

>>> analyzer.analyze_single('beatles help longbox sealed usa 3 cd single', 3)
                  Music      Crafts      Travel
sealed        4.828e-01   1.050e-03  -5.383e-02
cd            2.872e+00  -1.032e-01  -1.723e-01
cd single     1.663e-01  -5.181e-03  -6.558e-03
single        4.375e-01  -6.953e-02  -9.960e-02
usa           2.247e-01   3.530e-02   2.657e-02
beatles       5.050e-01  -5.710e-02  -6.933e-02
3 cd          1.320e-02  -3.837e-02  -7.793e-20
3             3.057e-02   4.712e-02   1.402e-01
**decval**    1.673e+00  -6.716e-02  -8.299e-02

Additional Information

[1]H.-F. Yu, C.-H. Ho, Y.-C. Juan, and C.-J. Lin. LibShortText: A Library for Short-text Classification.
[2]H.-F. Yu, C.-H. Ho, P. Arunachalam, M. Somaiya, and C.-J. Lin. Product title classification versus text classification.

For any questions and comments, please email cjlin@csie.ntu.edu.tw.