`lookout.style.typos.utils`¶

Various glue functions to work with the input dataset and the output from FastText.

Module Contents¶

lookout.style.typos.utils.Columns¶

lookout.style.typos.utils.Candidate¶

lookout.style.typos.utils.TEMPLATE_DIR¶

lookout.style.typos.utils.filter_splits(data:pandas.DataFrame, tokens:Set[str])¶

Leave rows in a dataframe whose splits’ tokens all belong to some vocabulary.

Parameters:	data – Dataframe which contains column Columns.Split. tokens – Set of tokens (reference vocabulary).
Returns:	Filtered dataframe.

lookout.style.typos.utils.print_frequencies(frequencies:Dict[str, int], path:str)¶

Print frequencies of tokens to a file.

Frequencies info is obtained from id_stats dataframe. :param frequencies: Dictionary of tokens’ frequencies. :param path: Path to a .csv file to print frequencies to.

lookout.style.typos.utils.read_frequencies(file:str)¶

Read token frequencies from the file.

Parameters:	file – Path to the .csv file with space-separated word-frequency pairs one-per-line.
Returns:	Dictionary of tokens frequencies.

lookout.style.typos.utils.read_vocabulary(file:str)¶

Read vocabulary tokens from the text file.

Parameters:	file – .csv file in which the vocabulary of corrections candidates is stored. First token in every line split-by-space is added to the vocabulary. :return: List of tokens of the vocabulary.

lookout.style.typos.utils.flatten_df_by_column(data:pandas.DataFrame, column:str, new_column:str, apply_function=lambda x: x)¶

Flatten DataFrame by column with extracted elements put to new_column. Operation runs out-of-place.

Parameters:	data – DataFrame to flatten. column – Column to expand. new_column – Column to populate with elements from flattened column. apply_function – Function used to expand every element of flattened column.
Returns:	Flattened DataFrame.

lookout.style.typos.utils.add_context_info(data:pandas.DataFrame)¶

Split context of identifier on before and after part and return new dataframe with the info.

Parameters:	data – DataFrame, containing columns Columns.Token and Columns.Split.
Returns:	New dataframe with added columns Columns.Before and Columns.After, containing corresponding parts of the context from column Columns.Split.

lookout.style.typos.utils.rank_candidates(candidates:pandas.DataFrame, pred_probs:Sequence[float], n_candidates:Optional[int]=None, return_all:bool=True)¶

Rank candidates for tokens’ correction based on the correctness probabilities.

Parameters:	candidates – DataFrame with columns Columns.Id, Columns.Token, Columns.Candidate and indexed by range(len(pred_proba)). pred_probs – Array of probabilities of correctness of every candidate. n_candidates – Number of most probably correct candidates to return for each typo. return_all – False to return corrections only for tokens corrected in the first candidate.
Returns:	Dictionary {id : [Candidate, …]}, candidates are sorted by correct_prob in a descending order.

lookout.style.typos.utils.suggestions_to_df(data:pandas.DataFrame, suggestions:Dict[int, List[Candidate]])¶

Convert suggestions from dictionary to pandas.DataFrame.

Parameters:	data – DataFrame containing column Columns.Token. suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns:	DataFrame with columns Columns.Token, Columns.Suggestions, indexed by data.index.

lookout.style.typos.utils.suggestions_to_flat_df(data:pandas.DataFrame, suggestions:Dict[int, List[Tuple[str, float]]])¶

Convert suggestions from dictionary to pandas.DataFrame, flattened by suggestions column.

Parameters:	data – DataFrame containing column Columns.Token. suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns:	DataFrame with columns Columns.Token, Columns.Candidate, Columns.Probability, indexed by data.index.

`lookout.style.typos.utils`¶

Module Contents¶

Lookout Style Analyzer

Navigation

Related Topics

lookout.style.typos.utils¶

Module Contents¶

`lookout.style.typos.utils`¶