:mod:`lookout.style.typos.generation`
=====================================

.. py:module:: lookout.style.typos.generation

.. autoapi-nested-parse::

   Generation of the typo correction candidates. Contains features extraction and serialization.


Module Contents
---------------


.. data:: TypoInfo
   

.. data:: Features
   

.. py:class:: CandidatesGenerator(**kwargs)

   Bases::class:`modelforge.Model`

   
   Looks for candidates for correction of typos and generates features     for them. Candidates are generated in three ways:     1. Closest by cosine distance of embeddings to the given token.     2. Closest by cosine distance to the compound vector of token context.     3. Closest by the edit distance and most frequent tokens from vocabulary.


   .. attribute:: NAME
      :annotation: = candidates_generator 

      
   .. attribute:: VENDOR
      :annotation: = source{d} 

      
   .. attribute:: DESCRIPTION
      :annotation: = Model that generates candidates to fix typos. 

      
   .. attribute:: LICENSE
      

   .. attribute:: NO_COMPRESSION
      :annotation: = ['/wv/vectors/'] 

      
   .. method:: construct(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None)

      
      Construct correction candidates generator.

      :param vocabulary_file: Text file used to generate vocabulary of correction                                 candidates. First token in every line split is added                                 to the vocabulary.
      :param frequencies_file: Path to the text file with frequencies. Each line must                                  be two values separated with a whitespace: "token count".
      :param embeddings_file: Path to the dump of FastText model.
      :param config: Candidates generation configuration, options:
                     neighbors_number: Number of neighbors of context and typo embeddings                                          to consider as candidates (int).
                     edit_dist_number: Number of the most frequent tokens among tokens on                                          equal edit distance from the typo to consider as                                          candidates (int).
                     max_distance: Maximum edit distance for symspell lookup for candidates                                     (int).
                     radius: Maximum edit distance from typo allowed for candidates (int).
                     max_corrected_length: Maximum length of prefix in which symspell lookup                                              for typos is conducted (int).
                     start_pool_size: Length of data, starting from which multiprocessing is                                         desired (int).
                     chunksize: Max size of a chunk for one process during multiprocessing (int).
                     set_min_freq: True to set the frequency of the unknown tokens to the                                      minimum frequency in the vocabulary. It is set to zero                                      otherwise.

      
   .. method:: set_config(self, config:Optional[Mapping[str, Any]]=None)

      
      Update candidates generation config.

      :param config: Candidates generation configuration, options:
                     neighbors_number: Number of neighbors of context and typo embeddings                                          to consider as candidates (int).
                     edit_dist_number: Number of the most frequent tokens among tokens at                                          equal edit distance from the typo to consider as                                          candidates (int).
                     max_distance: Maximum edit distance for symspell lookup for candidates                                     (int).
                     radius: Maximum edit distance from typo allowed for candidates (int).
                     max_corrected_length: Maximum length of prefix in which symspell lookup                                              for typos is conducted (int).
                     start_pool_size: Length of data, starting from which multiprocessing is                                         desired (int).
                     chunksize: Max size of a chunk for one process during multiprocessing (int).

      
   .. method:: expand_vocabulary(self, additional_tokens:Iterable[str])

      
      Add given tokens to the generator's vocabulary.

      :param additional_tokens: Tokens to add to the vocabulary.

      
   .. method:: generate_candidates(self, data:pandas.DataFrame, processes_number:int, save_candidates_file:Optional[str]=None)

      
      Generate candidates for typos inside data.

      :param data: DataFrame which contains column Columns.Token.
      :param processes_number: Number of processes for multiprocessing.
      :param save_candidates_file: File to save candidates to.
      :return: DataFrame containing candidates for corrections                  and features for their ranking for each typo.

      
   .. method:: dump(self)

      
      Represent the candidates generator.

      
   .. method:: __eq__(self, other:'CandidatesGenerator')

      
   .. method:: _lookup_corrections_for_token(self, typo_info:TypoInfo)

      
   .. method:: _get_candidate_tokens(self, typo_info:TypoInfo)

      
   .. method:: _generate_features(self, typo_info:TypoInfo, dist:int, typo_vec:numpy.ndarray, candidate:str, candidate_vec:numpy.ndarray)

      
      Compile features for a single correction candidate.

      :param typo_info: instance of TypoInfo class.
      :param dist: edit distance from candidate to typo.
      :param typo_vec: embedding of the original token.
      :param candidate: candidate token.
      :param candidate_vec: embedding of the candidate token.
      :return: index, typo and candidate tokens, frequencies info,                  cosine distances between embeggings and contexts,                  edit distance between the tokens,                  embeddings of the tokens and contexts.

      
   .. method:: _vec(self, token:str)

      
   .. method:: _freq(self, token:str)

      
   .. staticmethod:: _cos(first_vec:numpy.ndarray, second_vec:numpy.ndarray)

      
   .. method:: _min_cos(self, typo_vec:numpy.ndarray, context:str)

      
   .. method:: _avg_cos(self, typo_vec:numpy.ndarray, context:str)

      
   .. method:: _closest(self, item:Union[numpy.ndarray, str], quantity:int)

      
   .. method:: _freq_relation(self, first_token:str, second_token:str)

      
   .. method:: _compound_vec(self, text:str)

      
   .. method:: _generate_tree(self)

      
   .. method:: _load_tree(self, tree:dict)

      
.. function:: get_candidates_features(candidates:pandas.DataFrame)

   
   Take the feature vectors belonging to the typo correction candidates from the table.

   
.. function:: get_candidates_metadata(candidates:pandas.DataFrame)

   
   Take the information about the typo correction candidates from the table.