:mod:`lookout.style.typos.utils`
================================

.. py:module:: lookout.style.typos.utils

.. autoapi-nested-parse::

   Various glue functions to work with the input dataset and the output from FastText.


Module Contents
---------------


.. data:: Columns
   

.. data:: Candidate
   

.. data:: TEMPLATE_DIR
   

.. function:: filter_splits(data:pandas.DataFrame, tokens:Set[str])

   
   Leave rows in a dataframe whose splits' tokens all belong to some vocabulary.

   :param data: Dataframe which contains column Columns.Split.
   :param tokens: Set of tokens (reference vocabulary).
   :return: Filtered dataframe.

   
.. function:: print_frequencies(frequencies:Dict[str, int], path:str)

   
   Print frequencies of tokens to a file.

   Frequencies info is obtained from id_stats dataframe.
   :param frequencies: Dictionary of tokens' frequencies.
   :param path: Path to a .csv file to print frequencies to.

   
.. function:: read_frequencies(file:str)

   
   Read token frequencies from the file.

   :param file: Path to the .csv file with space-separated word-frequency pairs one-per-line.
   :return: Dictionary of tokens frequencies.

   
.. function:: read_vocabulary(file:str)

   
   Read vocabulary tokens from the text file.

   :param file: .csv file in which the vocabulary of corrections candidates is stored.                  First token in every line split-by-space is added to the vocabulary.     :return: List of tokens of the vocabulary.

   
.. function:: flatten_df_by_column(data:pandas.DataFrame, column:str, new_column:str, apply_function=lambda x: x)

   
   Flatten DataFrame by `column` with extracted elements put to `new_column`.     Operation runs out-of-place.

   :param data: DataFrame to flatten.
   :param column: Column to expand.
   :param new_column: Column to populate with elements from flattened column.
   :param apply_function: Function used to expand every element of flattened column.
   :return: Flattened DataFrame.

   
.. function:: add_context_info(data:pandas.DataFrame)

   
   Split context of identifier on before and after part and return new dataframe with the info.

   :param data: DataFrame, containing columns Columns.Token and Columns.Split.
   :return: New dataframe with added columns Columns.Before and Columns.After,              containing corresponding parts of the context from column Columns.Split.

   
.. function:: rank_candidates(candidates:pandas.DataFrame, pred_probs:Sequence[float], n_candidates:Optional[int]=None, return_all:bool=True)

   
   Rank candidates for tokens' correction based on the correctness probabilities.

   :param candidates: DataFrame with columns Columns.Id, Columns.Token, Columns.Candidate                        and indexed by range(len(pred_proba)).
   :param pred_probs: Array of probabilities of correctness of every candidate.
   :param n_candidates: Number of most probably correct candidates to return for each typo.
   :param return_all: False to return corrections only for tokens corrected in the                        first candidate.
   :return: Dictionary `{id : [Candidate, ...]}`, candidates are sorted              by correct_prob in a descending order.

   
.. function:: suggestions_to_df(data:pandas.DataFrame, suggestions:Dict[int, List[Candidate]])

   
   Convert suggestions from dictionary to pandas.DataFrame.

   :param data: DataFrame containing column Columns.Token.
   :param suggestions: Dictionary of suggestions, keys correspond with data.index.
   :return: DataFrame with columns Columns.Token, Columns.Suggestions, indexed by data.index.

   
.. function:: suggestions_to_flat_df(data:pandas.DataFrame, suggestions:Dict[int, List[Tuple[str, float]]])

   
   Convert suggestions from dictionary to pandas.DataFrame, flattened by suggestions column.

   :param data: DataFrame containing column Columns.Token.
   :param suggestions: Dictionary of suggestions, keys correspond with data.index.
   :return: DataFrame with columns Columns.Token, Columns.Candidate, Columns.Probability,
            indexed by data.index.