lookout.style.typos.utils

Various glue functions to work with the input dataset and the output from FastText.

Module Contents

lookout.style.typos.utils.Columns
lookout.style.typos.utils.Candidate
lookout.style.typos.utils.TEMPLATE_DIR
lookout.style.typos.utils.filter_splits(data:pandas.DataFrame, tokens:Set[str])

Leave rows in a dataframe whose splits’ tokens all belong to some vocabulary.

Parameters:
  • data – Dataframe which contains column Columns.Split.
  • tokens – Set of tokens (reference vocabulary).
Returns:

Filtered dataframe.

lookout.style.typos.utils.print_frequencies(frequencies:Dict[str, int], path:str)

Print frequencies of tokens to a file.

Frequencies info is obtained from id_stats dataframe. :param frequencies: Dictionary of tokens’ frequencies. :param path: Path to a .csv file to print frequencies to.

lookout.style.typos.utils.read_frequencies(file:str)

Read token frequencies from the file.

Parameters:file – Path to the .csv file with space-separated word-frequency pairs one-per-line.
Returns:Dictionary of tokens frequencies.
lookout.style.typos.utils.read_vocabulary(file:str)

Read vocabulary tokens from the text file.

Parameters:file – .csv file in which the vocabulary of corrections candidates is stored. First token in every line split-by-space is added to the vocabulary. :return: List of tokens of the vocabulary.
lookout.style.typos.utils.flatten_df_by_column(data:pandas.DataFrame, column:str, new_column:str, apply_function=lambda x: x)

Flatten DataFrame by column with extracted elements put to new_column. Operation runs out-of-place.

Parameters:
  • data – DataFrame to flatten.
  • column – Column to expand.
  • new_column – Column to populate with elements from flattened column.
  • apply_function – Function used to expand every element of flattened column.
Returns:

Flattened DataFrame.

lookout.style.typos.utils.add_context_info(data:pandas.DataFrame)

Split context of identifier on before and after part and return new dataframe with the info.

Parameters:data – DataFrame, containing columns Columns.Token and Columns.Split.
Returns:New dataframe with added columns Columns.Before and Columns.After, containing corresponding parts of the context from column Columns.Split.
lookout.style.typos.utils.rank_candidates(candidates:pandas.DataFrame, pred_probs:Sequence[float], n_candidates:Optional[int]=None, return_all:bool=True)

Rank candidates for tokens’ correction based on the correctness probabilities.

Parameters:
  • candidates – DataFrame with columns Columns.Id, Columns.Token, Columns.Candidate and indexed by range(len(pred_proba)).
  • pred_probs – Array of probabilities of correctness of every candidate.
  • n_candidates – Number of most probably correct candidates to return for each typo.
  • return_all – False to return corrections only for tokens corrected in the first candidate.
Returns:

Dictionary {id : [Candidate, …]}, candidates are sorted by correct_prob in a descending order.

lookout.style.typos.utils.suggestions_to_df(data:pandas.DataFrame, suggestions:Dict[int, List[Candidate]])

Convert suggestions from dictionary to pandas.DataFrame.

Parameters:
  • data – DataFrame containing column Columns.Token.
  • suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns:

DataFrame with columns Columns.Token, Columns.Suggestions, indexed by data.index.

lookout.style.typos.utils.suggestions_to_flat_df(data:pandas.DataFrame, suggestions:Dict[int, List[Tuple[str, float]]])

Convert suggestions from dictionary to pandas.DataFrame, flattened by suggestions column.

Parameters:
  • data – DataFrame containing column Columns.Token.
  • suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns:

DataFrame with columns Columns.Token, Columns.Candidate, Columns.Probability, indexed by data.index.