Data Preprocessing Tool — converter

converter module is used convert a text data set to a numerical data set. More specifically, it converts a text file to a LIBSVM-format data. Refer to Installation and Data Format for the format of texts.

The utilities of converter is wrapped in Text2svmConverter. Text2svmConverter consists of three components: TextPreprocessor, FeatureGenerator, and ClassMapping. For users who only need the most basic usage, they can use the utility function convert_text() without understanding converter.

Utility Function

libshorttext.converter.convert_text(text_src, converter, output='')

Convert a text data to a LIBSVM-format data.

text_src is the path of the text data or a file. (Refer to Installation and Data Format). output is the output of the converted LIBSVM-format data. output can also be a file path or a file. Note that if text_src or output is a file, it will be closed. converter is a Text2svmConverter instance.

Converter Class — Text2svmConverter

class libshorttext.converter.Text2svmConverter(option='', readonly=False)

Bases: object

Text2svmConverter converts a text data to a LIBSVM-format data. (Refer to Installation and Data Format for text data format.) It consists of three components: TextPreprocessor, FeatureGenerator, and ClassMapping.

The option can be any option of TextPreprocessor, FeatureGenerator and ClassMapping.

Note

Redundant options are ignored quietly. Users should pay attention to the spelling of the options.

Text2svmConverter can be read only if the flag is set. If it is not read only, the converter will be updated if new tokens or new class names are found.

class_map = None

The ClassMapping instance.

feat_gen = None

The FeatureGenerator instance.

getClassIdx(class_name)

Return the class index by the class name.

getClassName(class_idx)

Return the class name by the class index.

get_fidx2tok(fidx)

Return the token by the corresponding feature index.

load(src_dir, readonly=True)

Load the model from a directory.

merge_svm_files(svm_file, extra_svm_files)

Append extra feature files to svm_file.

extra_svm_files is a class:list of extra feature files in LIBSVM-format. These features will be appended to svm_file. All files in extra_svm_files and svm_file should have the same number of instances.

Note

The output file is svm_file. Therefore, the original svm_file will be overwritten without backup.

save(dest_dir)

Save the model to a directory.

text_prep = None

The TextPreprocessor instance.

toSVM(text, class_name=None, extra_svm_feats=[])

Return an LIBSVM python interface instance by the text. Note that feat_gen will be updated if the converter is not read only and there are new tokens in the given text.

extra_svm_feats is a list of feature sets, each of which is a ‘class’:dict. The length should be zero or the same as the extra svm files used. If the length is zero (i.e., an empty list), then the features returned as if there is no extra svm files.

Other Components of converter

The following description introduces other components employed by Text2svmConverter. They are for advanced use. In the end of this section, we give an example to customize the feature preprocessor. (Refer to Customized Preprocessing.)

TextPreprocessor

TextPreprocessor is used for preprocessing, which includes the following three steps:

  1. Tokenization
  2. Stop-word removal
  3. Stemming
class libshorttext.converter.TextPreprocessor(option='-stemming 0 -stopword 0', readonly=False)

Bases: object

TextPreprocessor is used to pre-process the raw texts to a list of feature indices. First, each text is tokenized by the tokenizer into a list of tokens. Tokens are then passed to the stemmer and the stopword_remover. Finally, each stemmed token is converted to a token index.

Refer to parse_option() for the option parameter.

If readonly is set to True, the feature index mapping will not be updated even if new tokens are explored. These new tokens will be ignored. readonly should be set as True for test, and False for training.

static default_stoplist()

Return a default stopword list provided by LibShortText.

Note that LibShortText stems words first (if stemmer is provided). Therefore, all words on the stopword list should be stemmed first. The following example creates a stoplist_remover from a list.

>>> from libshorttext.converter import *
>>> 
>>> preprocessor = TextPreprocessor('-stemming 1')
>>> stoplist = preprocessor.stemmer(list(TextPreprocessor.default_stoplist()))
>>> preprocessor.stopword_remover = lambda text: filter(
...     lambda token: token not in stoplist, text)
static default_tokenizer(text)

The default tokenizer provided by LibShortText.

The default tokenizer is used to tokenize English documents. It splits a text to tokens by whitespace characters, and normalizes tokens using NFD (normalization form D).

get_idx2tok(idx)

Access the index-token mapping. Given a numerical idx, this function returns the corresponding token.

Note

Because the index-to-token mapping is not maintained internally, the first time to call this function takes longer time to build the reverse mapping. This function should be always called with a readonly TextPreprocessor instance to avoid inconsistence between the token-to-index mapping and its reverse.

load(src_file, readonly=True)

Load the TextPreprocessor instance from the src_file file, which is a pickle file generated by cPickle.

If readonly is True, the TextPreprocessor instance will not be modifiable.

parse_option(option)

Parse the given str parameter option and set stemmer and stopword_remover to the desired functions.

option is a str instance:

Options Description
-stopword method If method is 1, then default_stoplist() is used. If method is 0, then no word will be removed. Default is 0 (no stopword removal).
-stemming method If method is 1, then Porter stemmer is used. If method is 0, tokens are not stemmed. Default is 0 (no stemming).

The following example creates a TextPreprocessor that applies Porter stemmer and removes stop words.

>>> preprocessor = TextPreprocessor()
>>> preprocessor.parse_option('-stopword 1 -stemming 1')

Note

Redundant options are ignored quietly. Users should pay attention to the spelling of the options.

preprocess(text)

Preprocess the given text into a list of token indices, where text is a str instance.

If the preprocessor is not in the read-only mode, preprocess() expands the internal token-index mapping for unseen tokens; otherwise, this function ignores unseen tokens.

save(dest_file)

Save the TextPreprocessor to a file.

Note

Function variables are not saved by this method. Even if stopword_remover, stemmer, or tokenizer are modified, they will not be saved accordingly. Therefore, they must be set again after being loaded. Refer to Customized Preprocessing.

stemmer = None

The function used to stem tokens.

Refer to Customized Preprocessing.

stopword_remover = None

The function used to remove stop words.

Refer to Customized Preprocessing.

tokenizer = None

The function used to tokenize texts into a list of tokens.

Refer to Customized Preprocessing.

FeatureGenerator

class libshorttext.converter.FeatureGenerator(option='-feature 1', readonly=False)

Bases: object

FeatureGenerator is used to generate uni-gram or bi-gram features.

bigram(text)

Generate a dict corresponding to the sparse vector of the bi-gram representation of the given text, which is a list of tokens.

feat_gen = None

feat_gen is variable pointing to the function that conducts feature generation. It can be either unigram() or bigram(), determined by option.

get_fidx2ngram(fidx)

Access the index-to-ngram mapping. Given a numerical fidx, this function returns the corresponding ngram.

Note

Because the index-to-ngram mapping is not maintained internally, the first time to call this function takes longer time to build the mapping. This function should be always called with a readonly FeatureGenerator instance to avoid inconsistence between the ngram-to-index mapping and its reverse.

load(src_file, readonly=True)

Load the FeatureGenerator instance from the src_file file, which is a pickle file generated by cPickle. We suggest using Python 2.7 or newer versions for faster implementation of cPickle.

If readonly is True, the FeatureGenerator instance will be readonly.

parse_option(option)

Parse the given str parameter option and set feat_gen to the desired function.

There is only one option in this version.

Option Description
-feature method If method is 1, then bigram is used. If method is 0, unigram is used. Default is 1 (bigram).

For example, the following example creates a unigram feature generator.

>>> feature_generator = FeatureGenerator()
>>> feature_generator.parse_option('-feature 0')

Note

Redundant options are ignored quietly. Users should pay attention to the spelling of the options.

save(dest_file)

Save the FeatureGenerator instance into the dest_file file, which will be a pickle file generated by cPickle. We suggest using Python 2.7 or newer versions for faster implementation of cPickle.

toSVM(text)

Generate a dict instance for the given text, which is a list of tokens. Each key of the returning dictionary is an index corresponding to an ngram feature, while the corresponding value is the count of the occurrence of that feature.

If not in read only mode, this function expands the internal ngram-index mapping for unseen ngrams; otherwise, this function ignores unseen ngrams.

unigram(text)

Generate a dict corresponding to the sparse vector of the uni-gram representation of the given text, which is a list of tokens.

ClassMapping

class libshorttext.converter.ClassMapping(option='', readonly=False)

Bases: object

ClassMapping is used to handle the mapping between the class label and the internal class index.

option is ignored in this version.

load(src_file, readonly=True)

Load the ClassMapping instance from the src_file file, which is a pickle file generated by cPickle.

If readonly is True, the ClassMapping instance will be readonly.

rename(old_label, new_label)

Rename the old_label to the new_label. old_label can be either a str to denote the class label or an int class to denote the class index. new_label should be a str different from existing labels.

save(dest_file)

Save the ClassMapping instance into the dest_file file, which will be a pickle file generated by cPickle.

toClassName(idx)

Return the class label corresponding to the given class idx.

Note

This method will reconstruct the mapping if toIdx() has been called after the previous toClassName(). Users should not call toClassName() and toIdx() rotatively.

toIdx(class_name)

Return the internal class index for the given class_name.

If readonly is False, toIdx() generates a new index for a unseen class_name; otherwise, toIdx() returns None.

Customized Preprocessing

Users can easily customize the Text2svmConverter. See the following example to replace the default tokenizer.

>>> from libshorttext.converter import *
>>>
>>> def comma_tokenizer(text):
>>>        return text.lower().split(',')
>>>
>>> text_converter = Text2svmConverter()
>>> text_converter.text_prep.tokenizer = comma_tokenizer

The default tokenizer splits a text into tokens by space characters, (i.e., ' '). In the above example, the sentence is split to tokens by commas (','). This code is useful if the input text is in the CSV (comma separate values) form.

The three attributes of TextPreprocessorTextPreprocessor.stopword_remover, TextPreprocessor.stemmer, and TextPreprocessor.tokenizer — can be replaced in a similar way. The following table shows the protocol of these functions. Note that they are listed in the same order as they are called.

Functions Inputs Outputs
TextPreprocessor.tokenizer A str of text. A list of tokens.
TextPreprocessor.stemmer A list of tokens. A list of stemmed tokens.
TextPreprocessor.stopword_remover A list of tokens (stemmed if needed). A list of tokens, which is a subset of input with stopwords being removed.

If both stemming and stop-word removal are applied, LibShortText stems words first and then removes stop words. Therefore, if users replace the TextPreprocessor.stopword_remover, the input token may have to be stemmed first.

Note

TextPreprocessor.save() and TextPreprocessor.load() do not save and load the function variables (TextPreprocessor.stopword_remover, TextPreprocessor.stemmer, and TextPreprocessor.tokenizer). They need to be set again after loading if they were modified.

>>> from libshorttext.converter import *
>>>
>>> def comma_tokenizer(text):
>>>        return text.lower().split(',')
>>>
>>> text_converter = Text2svmConverter()
>>>
>>> # set the tokenizer and save the converter
>>> text_converter.text_prep.tokenizer = comma_tokenizer
>>> text_converter.save('converter_path')
>>>
>>> new_text_converter = Text2svmConverter()
>>>
>>> # load the converter and set the tokenizer again
>>> new_text_converter.load('converter_path')
>>> new_text_converter.text_prep.tokenizer = comma_tokenizer