Analysis Tool — analyzer

analyzer is used for micro (for a single text instance) or macro (e.g., accuracy) analysis. Users can use InstanceSet to specify the scope to analyze by Analyzer.

>>> from libshorttext.analyzer import *
>>> 
>>> # load instances from an analyzable predict result file
>>> insts = InstanceSet('prediction_result_path')
>>> # find instances labels whose true and predicted labels are as specified
>>> insts = insts.select(with_labels(['Books', 'Music', 'Art']))
>>> 
>>> # create an analyzer
>>> analyzer = Analyzer('model_path')
>>> analyzer.gen_confusion_table(insts)
         Books  Music  Art
Books      169      1    0
Music        2    214    0
Art          6      0  162

To use the analysis tools, an analyzable result and a model are required. Refer to libshorttext.classifier.PredictionResult and libshorttext.classifier.TextModel.

Analyzer and Other Auxiliary Classes

class libshorttext.analyzer.Analyzer(model=None)

Analyzer is a tool for analyzing a group of instances, which is controlled by InstanceSet. Typically Analyzer is initialized with a path to a model.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer('model_path')

It can also be initialized with a libshorttext.classifier.TextModel instance.

>>> from libshorttext.analyzer import *
>>> from libshorttext.classifier import *
>>> text_model = TextModel('model_path')
>>> analyzer = Analyzer(text_model)

You can also construct an analyzer without a model. However, model-dependent functions cannot be used.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer()
analyze_single(target, amount=5, output=None, extra_svm_feats=[])

analyze_single() is used to analyze a single instance. It prints weights of all features in some classes (default 5). The output is sorted according to decision values in descending order. target can be an instance or a string that you want to analyze. amount is how many instances you want to print. If output is specified by a path to a file, the result will be outputted to the file instead of on the screen.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer('model_path')
>>> insts = InstanceSet('prediction_result_path')
>>> insts.load_text()
>>> analyzer.analyze_single(insts[61], 3)
                    Jewelry & Watches  Cameras & Photo  Coins & Paper Money
pb                          7.589e-19        2.041e-01            0.000e+00
green                      -8.897e-02        1.227e-02           -1.507e-01
mm                          5.922e-01        6.731e-01            1.256e-03
onyx silver                 1.382e-01       -6.198e-02           -4.743e-19
48                         -1.792e-02        2.188e-02           -1.346e-04
pendant                     1.107e+00       -1.039e-01           -1.409e-01
silver pendant              2.455e-01       -7.826e-02           -8.379e-02
silver                      8.533e-01       -2.205e-02            8.076e-01
onyx                        1.520e-01       -6.198e-02           -4.743e-19
**decval**                  9.937e-01        1.944e-01            1.444e-01
>>> analyzer.analyze_single('MICKEY MOUSE POT STAKE', 3)
                Home & Garden  Video Games & Consoles  Computers/Tablets & Networking
mickey              9.477e-02              -3.168e-02                       6.722e-02
mouse               2.119e-01               2.039e-01                      -2.212e-02
pot                 8.897e-01              -5.167e-02                      -2.466e-02
stake               4.057e-01              -2.147e-02                      -3.699e-02
mickey mouse        1.146e-01              -3.168e-02                       6.784e-02
mouse pot           4.041e-01              -2.147e-02                      -1.588e-02
pot stake           5.363e-01              -2.147e-02                      -1.588e-02
**decval**          1.004e+00               9.255e-03                       7.385e-03

If target is a str and extra svm files are used in training, the same number of extra svm features can be specified in extra_svm_feats. Extra svm features should be a list of dictionaries. If target is a TextInstance, the extra features in the TextInstance will be used.

gen_confusion_table(pred_insts, output=None)

gen_confusion_table() generates a confusion table of a group of predicted instances pred_insts. If output is specified by a path to a file, the result will be outputted to the file instead of on the screen.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer('model_path')
>>> insts = InstanceSet('prediction_result_path')
>>> insts = insts.select(with_labels(['Books', 'Music', 'Art']))
>>> analyzer.gen_confusion_table(insts)
         Books  Music  Art
Books      169      1    0
Music        2    214    0
Art          6      0  162
info(pred_insts, output=None)

info() gets information about a group of instances (an object of InstanceSet). pred_insts is the target instances. If output is specified by a path to a file, the result will be outputted to the file instead of on the screen.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer('model_path')
>>> insts = InstanceSet('prediction_result_path')
>>> insts = insts.select(with_labels(['Books', 'Music', 'Art']))
>>> analyzer.info(insts)
Number of instances: 554
Accuracy: 0.983754512635 (545/554)
True labels: "Art"  "Books"  "Music"
Predict labels: "Art"  "Books"  "Music"
Text source:
/home/guestwalk/working/short_text/svn/software-dev/test_file
Selectors:
-> labels: "Books", "Music", "Art"
load_model(model)

load_model() is used to load a model into Analyzer. If you did not load a model in the constructor or if you would like to use another model, you can use this function.

There are two ways to load a model: from an instance of libshorttext.classifier.TextModel or a path to a model.

>>> from libshorttext.analyzer import *
>>> analyzer = Analyzer('original_model_path')
>>> analyzer.load_model('new_model_path')

Auxiliary Classes

Analyzer has two auxiliary classes: InstanceSet and TextInstance. InstanceSet is a set of instances. It is used to indicate which instances Analyzer should consider. Each instance in InstanceSet is a TextInstance instance.

class libshorttext.analyzer.InstanceSet(rst_src=None, text_src=None)

InstanceSet is a group of TextInstance instances. It is used to get a subset of interested data. It should be initialized with a prediction result file (and a testing data). By default, the path to the testing data is stored in the prediction result file so you can only give the path to prediction result file.

>>> from libshorttext.analyzer import *
>>> insts = InstanceSet('prediction_result_path')

If you have moved testing data, then you must re-assign the path to testing data.

>>> from libshorttext.analyzer import *
>>> insts = InstanceSet('prediction_result_path', 'testing_data_path')
load_text()

The text of instances are not stored in the prediction result file, so you need to call this function to load texts from testing data.

>>> from libshorttext.analyzer import *
>>> insts = InstanceSet('prediction_result_path')
>>> insts.load_text()

This method also load the extra svm features if extra svm files are used when training.

select(*sel_funcs)

This function helps users find interested data. The arguments are selector functions, where both the argument and returned values are lists. There are several build-in selector functions. Refer to Built-in Instance Selector Functions.

>>> from libshorttext.analyzer import *
>>> insts = InstanceSet('prediction_result_path')
>>> insts1 = insts.select(wrong, with_labels(['Books', 'Music'])) 
class libshorttext.analyzer.TextInstance(idx, true_y='', predicted_y='', text='', extra_svm_feats=[], decvals=None)

TextInstance represents a text instance. It includes the index, the true label, the predicted label, the text, and the decision values of the text instance. Normally you do not directly create an instance. Instead, it is usually manipulated by InstanceSet. For more information, please see the usage in InstanceSet.

decvals = None

A list of decision values. The length should be the number of classes.

extra_svm_feats = None

The extra svm features. The value is an empty str at the beginning. The value is filled after PredInst.load_text() is called.

idx = None

Instance index in the text source.

predicted_y = None

The predicted label.

text = None

The original text. The value is an empty str ('') at the beginning. The value is filled after PredInst.load_text() is called.

true_y = None

The true label (if provided in the text source in the prediction phase).

Built-in Instance Selector Functions

Four selector functions are defined in analyzer. They should be used with InstanceSet.select().

libshorttext.analyzer.reverse(insts)

Reverse the order of instances.

This function should be passed to InstanceSet.select() without any argument.

>>> insts = InstanceSet('prediction_result_path').select(reverse)
libshorttext.analyzer.sort_by_dec(insts)

Sort instances by the decision values of the predicted labels in ascending order. You can combine this function with reverse() to sort decision values from large to small.

>>> insts = InstanceSet('prediction_result_path').select(sort_by_dec, reverse)

This function should be passed to InstanceSet.select() without any argument.

libshorttext.analyzer.subset(amount, method='top')

Find a subset of the InstanceSet. amount is the number of selected instances. method can be 'top' or 'random'. If method is 'top', the first amount instances are selected. Otherwise, InstanceSet() selects instances randomly. If amount is larger than the number of instances, InstanceSet() will return all instances.

The 'top' method is useful when used after sort_by_dec(). The following example selects ten instances with the smallest decision values of the predicted label.

>>> insts = InstanceSet('prediction_result_path').select(sort_by_dec, subset(10))
libshorttext.analyzer.with_labels(labels, target='both')

Select instances with specified labels. labels is an iterable object of str instances, which represent the label names.

target can be 'true', 'predict', 'both', 'or'. If target is 'true', then this function finds instances based on the true label specified in the test data. If target is 'predict', it finds instances based on the predicted labels. 'both' and 'or' find the intersection and the union of 'true' and 'predict', respectively. The default value of 'target' is 'both'.

The following example selects instances where the true labels are 'Music' or 'Books'.

>>> insts = InstanceSet('prediction_result_path').select(with_labels(['Books', 'Music']))
libshorttext.analyzer.wrong(insts)

Select wrongly predicted instances. It assumes that the labels in the test data are true labels.

This function should be passed to InstanceSet.select() without any argument.

>>> insts = InstanceSet('prediction_result_path').select(wrong)

User-defined Selector Functions

Users can define a selector function by themselves. A selector function’s input is a list of TextInstance, and the return value should be another list of TextInstance. For example, wrong() is equivalent to the following example:

def wrong(insts):
    wrong_label = lambda inst: inst.true_y != inst.predicted_y
    wrong_insts = filter(wrong_label, insts)
    return list(wrong_insts)

Refer to TextInstance for the instance members.

analyzer provides a function selectorize() to convert some simple rules to a selector function. The example above can be further simplified by selectorize().

libshorttext.analyzer.selectorize(option='general', comment=None)

A function decorator which returns a function wrapper to generate a selector function.

option can be 'select', 'sort', or 'general'. See the following table.

option What should the defined function do?
'select' The defined function should decide whether an instance should be selected or not. Therefore, the input is a TextInstance, and the output should be True or False. True means that this instance should be selected.
'sort' The defined function should return the key of an TextInstance for sorting. The input is a TextInstance, and the output should be a value or an object that is comparable.
'general' Equivalent to the original function without applying the function wrapper. Therefore, the defined function’s input and output are a list of TextInstance.

For example, wrong() is equivalent to the following function:

@selectorize('select', 'Select wrongly predicted instances')
def wrong(inst):
        return inst.true_y !=  inst.predicted_y

And, sort_by_dec() is equivalent to the following function:

@selectorize('sort', 'Sort by maximum decision values.')
def sort_by_dec(inst):
        return max(inst.decvals)

comment is the argument of the comment on the function, which will be shown by the libshorttext.analyzer.Analyzer.info(). See the following example.

>>> from libshorttext.analyzer import *
>>> 
>>> @selectorize(comment = 'foo function')
>>> def foo(x):
>>>     return x
>>> 
>>> insts = InstanceSet('predict_result_path').select(foo)
>>> Analyzer('model_path').info(insts)
[output skipped]
Selectors :
-> foo function