********************************************
Data Preprocessing Tool --- :mod:`converter`
********************************************

.. automodule:: libshorttext.converter
        :members:
        :undoc-members:
        :inherited-members:
        :show-inheritance:


Utility Function
================

.. autofunction:: libshorttext.converter.convert_text


.. _Text2svmConverter:

Converter Class --- :class:`Text2svmConverter`
==============================================
.. autoclass:: libshorttext.converter.Text2svmConverter
        :members:
        :undoc-members:
        :inherited-members:
        :show-inheritance:


Other Components of :mod:`converter`
====================================

The following description introduces other components employed by :class:`Text2svmConverter`.
They are for advanced use. In the end of this section, we
give an example to customize the feature preprocessor. (Refer to
:ref:`CustomizedPreprocessing`.)

:class:`TextPreprocessor`
~~~~~~~~~~~~~~~~~~~~~~~~~

:class:`TextPreprocessor` is used for preprocessing, 
which includes the following three steps:

	1. Tokenization
	2. Stop-word removal
	3. Stemming


.. autoclass:: libshorttext.converter.TextPreprocessor
        :members:
        :undoc-members:
        :inherited-members:
        :show-inheritance:


:class:`FeatureGenerator`
~~~~~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: libshorttext.converter.FeatureGenerator
        :members:
        :undoc-members:
        :inherited-members:
        :show-inheritance:


:class:`ClassMapping`
~~~~~~~~~~~~~~~~~~~~~
.. autoclass:: libshorttext.converter.ClassMapping
        :members:
        :undoc-members:
        :inherited-members:
        :show-inheritance:


.. _CustomizedPreprocessing:

Customized Preprocessing
------------------------

Users can easily customize the :class:`Text2svmConverter`. See the following
example to replace the default tokenizer.

        >>> from libshorttext.converter import *
        >>> 
        >>> def comma_tokenizer(text):
        >>>        return text.lower().split(',')
        >>> 
        >>> text_converter = Text2svmConverter()
        >>> text_converter.text_prep.tokenizer = comma_tokenizer

The default tokenizer splits a text into tokens by space characters, (i.e.,
``' '``). In the above example, the sentence is split to tokens by
commas (``','``). This code is useful if the input text is in the CSV (comma separate values) form.

The three attributes of :class:`TextPreprocessor` --- 
:attr:`TextPreprocessor.stopword_remover`, :attr:`TextPreprocessor.stemmer`, 
and :attr:`TextPreprocessor.tokenizer` --- can be replaced in a similar
way. The following table shows the protocol of these functions. Note that
they are listed in the same order as they are called.

========================================= ================== ===================
Functions                                 Inputs             Outputs
========================================= ================== ===================
:attr:`TextPreprocessor.tokenizer`        A :class:`str` of  A :class:`list` of
                                          text.              tokens.
:attr:`TextPreprocessor.stemmer`          A :class:`list` of A :class:`list` of
                                          tokens.            stemmed tokens.
:attr:`TextPreprocessor.stopword_remover` A :class:`list`    A :class:`list` of
                                          of tokens (stemmed tokens, which is
                                          if needed).        a subset of input
                                                             with stopwords 
                                                             being removed.
========================================= ================== ===================

If both stemming and stop-word removal are applied, LibShortText stems words first 
and then removes stop words. Therefore, if users replace the :attr:`TextPreprocessor.stopword_remover`, 
the input token may have to be stemmed first.

.. note::

        :meth:`TextPreprocessor.save` and :meth:`TextPreprocessor.load` do 
        **not** save and load the function variables 
        (:attr:`TextPreprocessor.stopword_remover`, 
        :attr:`TextPreprocessor.stemmer`, and 
        :attr:`TextPreprocessor.tokenizer`). They need to be set again after
        loading if they were modified. :: 

        >>> from libshorttext.converter import *
        >>>
        >>> def comma_tokenizer(text):
        >>>        return text.lower().split(',')
        >>>
        >>> text_converter = Text2svmConverter()
        >>>
        >>> # set the tokenizer and save the converter
        >>> text_converter.text_prep.tokenizer = comma_tokenizer
        >>> text_converter.save('converter_path')
        >>>
        >>> new_text_converter = Text2svmConverter()
        >>>
        >>> # load the converter and set the tokenizer again
        >>> new_text_converter.load('converter_path')
        >>> new_text_converter.text_prep.tokenizer = comma_tokenizer