High-level Training and Test Tool — classifier

The classifier is a high-level interface to train a short-text data. Members of classifier include TextModel and its utility functions. TextModel is obtained in training and then used in prediction.

The standard method to get a TextModel instance is via function train_text() or train_converted_text(), which trains text data (refer to Installation and Data Format) or LIBSVM-format data, respectively.

>>> from libshorttext.classifier import *
>>> # train a model and save it to a file
>>> m, svm_file = train_text('train_file')
>>> # save the model to a file
>>> m.save('model_path')

After obtaining a TextModel, users can use predict_text() or predict_single_text() to predict the label of a new short text.

>>> from libshorttext.classifier import *
>>> # load a model from a file
>>> m = TextModel('model_path')
>>> # predict a sentence
>>> result = predict_single_text('This is a sentence.', m) 

Another class in module classifier is PredictionResult, which is a wrapper of prediction results. Both predict_text() and predict_single_text() return a PredictionResult object.

classifier does not access the low-level LIBLINEAR’s train and predict utilities directly. All jobs are passed to a submodule called learner, which is a middle-level classifier and communicates between classifier and LIBLINEAR. Users can also use the learner module directly without classifier to achieve more complicated usages.

Utility Functions

Utility functions provide the basic training and test procedures. The following example shows how to use these utility functions.

>>> from libshorttext.classifier import *
>>> # training
>>> model, svm_file = train_text('train_file')
>>> # test
>>> results = predict_text('test_file', model)
>>> # save the predicted results
>>> results.save('result_path')
libshorttext.classifier.train_text(text_src, svm_file=None, converter_arguments='', grid_arguments='0', feature_arguments='', train_arguments='', extra_svm_files=[])

Return a tuple of TextModel and str instances, where the str is the file name of the preprocessed LIBSVM-format data.

text_src is the training data path. The data is in text format.

svm_file is the path where LIBSVM-format data will be generated. If this value is None or omitted, the generated LIBSVM-format data will be in the working directory with the same file name as text_src except the file extension .svm is appended. Otherwise, this value is the second returned value.

converter_arguments is the arguments for preprocessor. Refer to libshorttext.converter.convert_text().

grid_arguments is the arguments passed to grid. Please refer to LIBSVM document for the usage. If the grid_arguments is set to any value other than '0' or 0, this function will apply grid search first, and use the best parameter C in the training phase. Refer the LIBLINEAR document for more information of the parameter C.

feat_arguments is the arguments for feature processing. Refer to Parameters of the LearnerModule.

train_arguments is the LIBLINEAR arguments for training. Refer to LIBLINEAR documents. Note that the default solver is '-s 4'.

extra_svm_files is a list of user-defined LIBSVM-format file paths to be included during the training. Each file should have the same number of instances as text_src.

Note

If grid_arguments is not '0' and '-c C' option is also given in the train_arguments, ValueError will be raised.

Refer to train_converted_text() for the comparison between train_text() and train_converted_text().

libshorttext.classifier.train_converted_text(svm_file, text_converter, grid_arguments='0', feature_arguments='', train_arguments='')

Return a TextModel generated from the given LIBSVM-format data. svm_file is the path of the LIBSVM-format data. The second argument text_converter is a libshorttext.converter.Text2svmConverter instance, which was used before to convert texts to the LIBSVM-format data svm_file.

Refer to train_text() for details of arguments grid_arguments, feature_arguments, and train_arguments.

train_converted_text() assumes that svm_file is generated from the text_converter, so text_converter cannot be omitted. If users only need to train a LIBSVM-format data and do not know the preprocessing procedure, libshorttext.classifier.learner module should be used. Refer to The Middle-level Classification Module — learner.

Note that train_text() is implemented by calling this function. The following two segments of codes are basically equivalent.

>>> from libshorttext.classifier import *
>>> 
>>> m, svm_file = train_text('train_file', 'svm_file')

The above codes can be replaced by using the following codes.

>>> from libshorttext.classifier import *
>>> from libshorttext.converter import *
>>> 
>>> converter = Text2svmConverter()
>>> svm_file = 'svm_file'
>>> convert_text('train_file', converter, svm_file)
>>> m = train_converted_text(svm_file, converter)

The latter one is useful when users want to modify converter before training. For example, they can replace the default tokenizer of libshorttext.converter.Text2svmConverter to a user-defined function before calling train_converted_text(). Refer to Customized Preprocessing for more details of extending libshorttext.converter.Text2svmConverter.

libshorttext.classifier.predict_text(text_src, text_model, svm_file=None, predict_arguments='', extra_svm_files=[])

Return an analyzable PredictionResult instance. (Refer to PredictionResult.) The arguments text_src and svm_file are similar to those of train_text(), but for returned values predict_text() returns a PredictionResult instance instead of a tuple. The generated LIBSVM-data path is included in PredictionResult as a member value. (Refer to PredictionResult.svm_file.) The size of extra_svm_files should be zero or the same as the extra svm files of training. Extra svm files are in LIBSVM-format and have the same numbers of instances as text_src.

Example:

>>> from libshorttext.classifier import *
>>> textModel = TextModel('model_path')
>>> predictionResult = predict_text('train_file', textModel)
>>> print(predictionResult.predicted_y)
['Label 1', 'Label 2', 'Label 1']

Similar to train_arguments in train_context(), predict_arguments is the arguments for LIBLINEAR predict. Refer to LIBLINEAR documents for more details.

Although there are train_text() and train_converted_text() for training, there is no predict_converted_text(). Use libshorttext.classifier.learner.predict() instead for converted data.

libshorttext.classifier.predict_single_text(text, text_model, predict_arguments='', extra_svm_feats=[])

Return an unanalyzable PredictionResult instance. (Refer to PredictionResult.)

text is a short text in str. text can also be a LIBSVM list or dictionary. Refer to LIBSVM python interface document for more details.

extra_svm_feats is a list of extra features. The size is zero or the same as the extra features of training. Extra features are class:dict.

Note

extra_svm_feats are ignored if text is already a LIBSVM list or dictionary.

Note

In this version, there is no predict option and argument predict_arguments will be ignored.

Note

This function is designed to analyze the result of some specific short texts. It has a severe efficiency issue. If many instances need to be predicted, they should be stored in a file and predicted by predict_text() or libshorttext.classifier.learner.predict().

Classifier Model — TextModel

class libshorttext.classifier.TextModel(arg1=None, arg2=None)

Bases: object

TextModel is a high-level text model. It consists of a libshorttext.converter.Text2svmConverter instance and a learner.LearnerModel instance. TextModel is portable and can be saved to or loaded from a directory. It is mainly used to predict short texts by predict_text() and predict_single_text().

There are two usages of the constructor. For the first one, users can construct a TextModel instance from an existing model directory by specifying the directory path:

>>> m = TextModel('model_path')

For the other one, users can generate a new learner.TextModel object with a libshorttext.converter.Text2svmConverter instance and a learner.LearnerModel instance.

>>> m = TextModel(text2svmConverter, learnerModel)

In the second usage, both text2svmConverter and learnerModel can be None or ignored. Therefore, users can create an empty TextModel:

>>> m = TextModel()

TextModel generates an id for each model. The id will be the same if users load the whole model by the first usage: TextModel('model_path'). However, TextModel creates a new id if the second usage – TextModel(text2svmConverter, learnerModel) – is used. Therefore, even with the same text2svmConverter and learnerModel, the created TextModel instances are different. Analysis module libshorttext.analyzer uses the id to check the consistency of the model and the prediction results. A different id may introduce some warning messages, so to reload an existing model, users should use the first usage instead of creating a new model using the same parameters.

TextModel can also be generated by train_text() or train_converted_text(). Please refer to train_text() and train_converted_text() for the usage.

get_labels()

Return a list of labels in the model, which should be the labels in the training data.

get_weight(xi, labels=None, extra_svm_feats=[])

Return the weights of the model. It only returns the weights corresponding to the features of xi and the specified labels.

xi is a str of text. It can also be a LIBSVM python interface instance. Refer to LIBSVM python interface document for more details.

labels is a list of str representing the labels. If labels is None, all labels in the model are considered.

The returned value is a triple: features, weights, and labels. features is a list of features extracted from xi. labels is a list of labels. If the input argument labels is given, the input labels will be returned. weights is a list of lists. The length of weights is the number of features, and the length of each list in weights is the number of labels. The ordering of weights depends on the ordering of features and labels. See the following example.

>>> features, weights, labels = get_weight('a sentence', ['label 1', 'label 2'])

If the returned features and weights are:

>>> print(features, labels)
['sentence', 'a'] ['label 1', 'label 2']

then the weight of feature 'a' for 'label 1' can be obtained as follows.

>>> weights_j = weights[1] # the weights of 'a'
>>> weights_j_k = weights_j[0] # the weight of 'label 1'

If the extra features are included when training the model, extra_svm_feats can be given. extra_svm_feats is a class:list of class:list. The size should be zero or the same as the extra svm files in training phase.

load(model_name)

Load the contents from a TextModel directory. The following two code segments are equivalent.

>>> m = TextModel('model_path')
>>> m = TextModel()
>>> m.load('model_path')
save(model_name, force=False)

Save the model to a directory. If force is set to True, existing directory will be overwritten; otherwise, IOError will be raised.

Results of Prediction — PredictionResult

class libshorttext.classifier.PredictionResult(text_src=None, model_id=None, true_y=None, predicted_y=None, decvals=None, svm_file=None, labels=None, extra_svm_files=[])

Bases: object

PredictionResult is used to get predicted results generated by predict_text() or predict_single_text(). Note that there are two types of PredictionResult instances. One can be used for subsequent analysis (analyzable), and the other is not (unanalyzable). predict_text() returns an analyzable result, while predict_single_text() returns an unanalyzable result.

analyzable()

Return True if the result is analyzable; otherwise, return False.

decvals = None

Decision values of the results. For analyzable results, decvals is from learner.predict(). Refer to learner.predict() for more details. For unanalyzable results, decvals is a list of decision values, sorted according to the labels.

extra_svm_files = None

A list of extra feature file paths. The list is empty for unanalyzable results.

get_accuracy()

Return the accuracy of the prediction results. This method should be called only if the results are analyzable.

labels = None

A list of labels. If the result is analyzable, it is a list of labels in the model.

load(file_name)

Load the result from a file.

PredictionResult cannot be initialized by specifying the result file name. To create a PredictionResult instance from a file, use

>>> p = PredictionResult()
>>> p.load('predict_result')
model_id = None

The model ID. The value is None for unanalyzable results.

predicted_y = None

The predicted labels. It should be a list. However, it may also be a float value if it is returned by predict_single_text().

save(file_name, analyzable=False, fmt='.16g')

Save the model to a file.

analyzable indicates whether the output result is analyzable. If the result is not analyzable, analyzable cannot be set to True. If an analyzable result is not saved in the analyzable mode, it will not be analyzable after being reloaded.

>>> print(p.analyzable())
True
>>> p.save('predict_result.not_analyzable', False)
>>> p.load('predict_result.not_analyzable')
>>> print(p.analyzable())
False

fmt is the output format for floating-point numbers. Fewer digits may result result in better readability and smaller result file size.

svm_file = None

Path of the LIBSVM-format test data. The value is None for unanalyzable results.

text_src = None

The location of the text file of the test data. The value is None for unanalyzable results.

true_y = None

The true labels. For analyzable results, it should be a list, while for unanalyzable results, the value is None.

The Middle-level Classification Module — learner

The middle-level classifier learner is used to train or predict a LIBSVM-format data. This module extends LIBLINEAR python interface to provide more utilities such as various feature representations and instance-wise normalization. The only difference between this module and standard LIBLINEAR python interface is that learner provides more utilities, e.g., instance normalization, tf-idf, and binary feature representation. We call it as a middle-level classifier because it provides an interface between libshorttext.classifier and LIBLINEAR. Note that some of the utilities of learner is implemented in C language for efficiency.

Note

If the data set is in text format, use libshorttext.classfier rather than learner.

learner has three utility functions and one model class. If users want to replace learner module by their own implementation, they need to implement the three utility functions and LearnerModel, which will be used by libshorttext.classifier and libshorttext.analyzer.

Utility Functions of learner

libshorttext.classifier.learner.train(data_file_name, learner_opts='', liblinear_opts='')

Return a LearnerModel.

data_file_name is the file path of the LIBSVM-format data. learner_opts is a str. Refer to Parameters of the LearnerModule. liblinear_opts is a str of LIBLINEAR’s parameters. Refer to LIBLINEAR’s document.

libshorttext.classifier.learner.predict(data_file_name, m, liblinear_opts='')

Return a quadruple: the predicted labels, the accuracy, the decision values, and the true labels in the test data file (obtained through the LearnerModel m).

The predicted labels and true labels in the file are list. The accuracy is evaluated by assuming that the labels in the file are the true label.

The decision values are in a list, where the length is the same as the number of test instances. Each element in the list is a c_double array, and the values in the array are an instance’s decision values in different classes. For example, the decision value of instance i and class k can be obtained by

>>> predicted_label, accuracy, all_dec_values, label = predict('svm_file', model)
>>> print all_dec_values[i][k]
libshorttext.classifier.learner.predict_one(xi, m)

Return the label and a c_double array of decision values of the test instance xi using LearnerModel m.

xi can be a list or a dict as in LIBLINEAR python interface. It can also be a LIBLINEAR feature_node array.

Note

This function is designed to analyze the result of one instance. It has a severe efficiency issue and should be used only by libshorttext.classifier.predict_single_text(). If many instances need to be predicted, they should be stored in a file and predicted by predict().

Warning

The content of xi may be changed after the function call.

The Middle-level Model — LearnerModel

class libshorttext.classifier.learner.LearnerModel(c_model, param=None, idf=None)

Bases: liblinear.model

LearnerModel is a middle-level classification model. It inherits from liblinear.model by having two more members: a LearnerParameter instance and an inverse document frequency list.

We do not recommend users to create a LearnerModel by themselves. Instead, users should create and manipulate a LearnerModel via train(), predict(), and predict_one().

If users want to redefine LearnerModel, they must implement the following four methods used by libshorttext.classifier and libshorttext.analyzer.

get_labels()

Return the labels of this model.

get_weight(j, k)

Return the weight of feature j and label k.

load(model_dir)

Load the contents from a TextModel directory.

save(model_dir, force=False)

Save the model to a directory. If force is set to True, the existing directory will be overwritten; otherwise, IOError will be raised.

Parameters of the LearnerModule

The parameter of LearnerModel is wrapped in a structure class — LearnerParameter. In the following, we will introduce LearnerParameter and LibShortText’s parameters for LIBSVM-format data.

class libshorttext.classifier.learner.LearnerParameter(learner_opts='', liblinear_opts='')

Bases: liblinear.parameter

LearnerParameter is the parameter structure used by LearnerModel. It consists of normalization parameters and LIBLINEAR parameters.

Both liblinear_opts and learner_opts are str or a list of str. For example, you can write either

>>> param = LearnerParameter('-N 1 -T 1', '-c 2 -e 1e-2')

or

>>> param = LearnerParameter(['-N', '1', '-T', '1'], ['-c', '2', '-e', '1e-2'])

liblinear_opts is LIBLINEAR’s parameters. Refer to LIBLINEAR’s document for more details. learner_opts includes options for feature representation and instance-wise normalization. The preprocessor of LibShortText converts text files to LIBSVM-format data, where the features are word counts. All value in the options should be either 1 or 0, where 1 enables the option.

options explanation when value is 1
-D value Binary representation. All non-zero values are treated as 1. Default is enabled.
-T value Term frequency. The data are divided by the feature sum. That is, \(x_i \leftarrow (x_i)/\sum_j |x_j|\), where \(x\) is the training instance and \(x_i\) is the \(i\)-th feature of \(x\). Default is disabled.
-I value Inverse document frequency (idf). Default is disabled.
-N value Instance normalization. The training instances are normalized to unit vectors before training. Default is enabled.

Note that if more than one option is enabled, then they are done in the order: binary representation, term frequency, IDF, and instance normalization. The following example is tf-idf representation without instance normalization.

>>> param = LearnerParameter('-D 0 -T 1 -I 1 -N 0', liblinear_opts)
parse_options(learner_opts, liblinear_opts)

Set the options to the specific values.

set_to_default_values()

Set the options to some values ('-D 1 -T 0 -I 0 -N 1').