******************************************** Data Preprocessing Tool --- :mod:`converter` ******************************************** .. automodule:: libshorttext.converter :members: :undoc-members: :inherited-members: :show-inheritance: Utility Function ================ .. autofunction:: libshorttext.converter.convert_text .. _Text2svmConverter: Converter Class --- :class:`Text2svmConverter` ============================================== .. autoclass:: libshorttext.converter.Text2svmConverter :members: :undoc-members: :inherited-members: :show-inheritance: Other Components of :mod:`converter` ==================================== The following description introduces other components employed by :class:`Text2svmConverter`. They are for advanced use. In the end of this section, we give an example to customize the feature preprocessor. (Refer to :ref:`CustomizedPreprocessing`.) :class:`TextPreprocessor` ~~~~~~~~~~~~~~~~~~~~~~~~~ :class:`TextPreprocessor` is used for preprocessing, which includes the following three steps: 1. Tokenization 2. Stop-word removal 3. Stemming .. autoclass:: libshorttext.converter.TextPreprocessor :members: :undoc-members: :inherited-members: :show-inheritance: :class:`FeatureGenerator` ~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: libshorttext.converter.FeatureGenerator :members: :undoc-members: :inherited-members: :show-inheritance: :class:`ClassMapping` ~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: libshorttext.converter.ClassMapping :members: :undoc-members: :inherited-members: :show-inheritance: .. _CustomizedPreprocessing: Customized Preprocessing ------------------------ Users can easily customize the :class:`Text2svmConverter`. See the following example to replace the default tokenizer. >>> from libshorttext.converter import * >>> >>> def comma_tokenizer(text): >>> return text.lower().split(',') >>> >>> text_converter = Text2svmConverter() >>> text_converter.text_prep.tokenizer = comma_tokenizer The default tokenizer splits a text into tokens by space characters, (i.e., ``' '``). In the above example, the sentence is split to tokens by commas (``','``). This code is useful if the input text is in the CSV (comma separate values) form. The three attributes of :class:`TextPreprocessor` --- :attr:`TextPreprocessor.stopword_remover`, :attr:`TextPreprocessor.stemmer`, and :attr:`TextPreprocessor.tokenizer` --- can be replaced in a similar way. The following table shows the protocol of these functions. Note that they are listed in the same order as they are called. ========================================= ================== =================== Functions Inputs Outputs ========================================= ================== =================== :attr:`TextPreprocessor.tokenizer` A :class:`str` of A :class:`list` of text. tokens. :attr:`TextPreprocessor.stemmer` A :class:`list` of A :class:`list` of tokens. stemmed tokens. :attr:`TextPreprocessor.stopword_remover` A :class:`list` A :class:`list` of of tokens (stemmed tokens, which is if needed). a subset of input with stopwords being removed. ========================================= ================== =================== If both stemming and stop-word removal are applied, LibShortText stems words first and then removes stop words. Therefore, if users replace the :attr:`TextPreprocessor.stopword_remover`, the input token may have to be stemmed first. .. note:: :meth:`TextPreprocessor.save` and :meth:`TextPreprocessor.load` do **not** save and load the function variables (:attr:`TextPreprocessor.stopword_remover`, :attr:`TextPreprocessor.stemmer`, and :attr:`TextPreprocessor.tokenizer`). They need to be set again after loading if they were modified. :: >>> from libshorttext.converter import * >>> >>> def comma_tokenizer(text): >>> return text.lower().split(',') >>> >>> text_converter = Text2svmConverter() >>> >>> # set the tokenizer and save the converter >>> text_converter.text_prep.tokenizer = comma_tokenizer >>> text_converter.save('converter_path') >>> >>> new_text_converter = Text2svmConverter() >>> >>> # load the converter and set the tokenizer again >>> new_text_converter.load('converter_path') >>> new_text_converter.text_prep.tokenizer = comma_tokenizer