lookout.style.typos.utils
¶
Various glue functions to work with the input dataset and the output from FastText.
Module Contents¶
-
lookout.style.typos.utils.
Columns
¶
-
lookout.style.typos.utils.
Candidate
¶
-
lookout.style.typos.utils.
TEMPLATE_DIR
¶
-
lookout.style.typos.utils.
filter_splits
(data:pandas.DataFrame, tokens:Set[str])¶ Leave rows in a dataframe whose splits’ tokens all belong to some vocabulary.
Parameters: - data – Dataframe which contains column Columns.Split.
- tokens – Set of tokens (reference vocabulary).
Returns: Filtered dataframe.
-
lookout.style.typos.utils.
print_frequencies
(frequencies:Dict[str, int], path:str)¶ Print frequencies of tokens to a file.
Frequencies info is obtained from id_stats dataframe. :param frequencies: Dictionary of tokens’ frequencies. :param path: Path to a .csv file to print frequencies to.
-
lookout.style.typos.utils.
read_frequencies
(file:str)¶ Read token frequencies from the file.
Parameters: file – Path to the .csv file with space-separated word-frequency pairs one-per-line. Returns: Dictionary of tokens frequencies.
-
lookout.style.typos.utils.
read_vocabulary
(file:str)¶ Read vocabulary tokens from the text file.
Parameters: file – .csv file in which the vocabulary of corrections candidates is stored. First token in every line split-by-space is added to the vocabulary. :return: List of tokens of the vocabulary.
-
lookout.style.typos.utils.
flatten_df_by_column
(data:pandas.DataFrame, column:str, new_column:str, apply_function=lambda x: x)¶ Flatten DataFrame by column with extracted elements put to new_column. Operation runs out-of-place.
Parameters: - data – DataFrame to flatten.
- column – Column to expand.
- new_column – Column to populate with elements from flattened column.
- apply_function – Function used to expand every element of flattened column.
Returns: Flattened DataFrame.
-
lookout.style.typos.utils.
add_context_info
(data:pandas.DataFrame)¶ Split context of identifier on before and after part and return new dataframe with the info.
Parameters: data – DataFrame, containing columns Columns.Token and Columns.Split. Returns: New dataframe with added columns Columns.Before and Columns.After, containing corresponding parts of the context from column Columns.Split.
-
lookout.style.typos.utils.
rank_candidates
(candidates:pandas.DataFrame, pred_probs:Sequence[float], n_candidates:Optional[int]=None, return_all:bool=True)¶ Rank candidates for tokens’ correction based on the correctness probabilities.
Parameters: - candidates – DataFrame with columns Columns.Id, Columns.Token, Columns.Candidate and indexed by range(len(pred_proba)).
- pred_probs – Array of probabilities of correctness of every candidate.
- n_candidates – Number of most probably correct candidates to return for each typo.
- return_all – False to return corrections only for tokens corrected in the first candidate.
Returns: Dictionary {id : [Candidate, …]}, candidates are sorted by correct_prob in a descending order.
-
lookout.style.typos.utils.
suggestions_to_df
(data:pandas.DataFrame, suggestions:Dict[int, List[Candidate]])¶ Convert suggestions from dictionary to pandas.DataFrame.
Parameters: - data – DataFrame containing column Columns.Token.
- suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns: DataFrame with columns Columns.Token, Columns.Suggestions, indexed by data.index.
-
lookout.style.typos.utils.
suggestions_to_flat_df
(data:pandas.DataFrame, suggestions:Dict[int, List[Tuple[str, float]]])¶ Convert suggestions from dictionary to pandas.DataFrame, flattened by suggestions column.
Parameters: - data – DataFrame containing column Columns.Token.
- suggestions – Dictionary of suggestions, keys correspond with data.index.
Returns: DataFrame with columns Columns.Token, Columns.Candidate, Columns.Probability, indexed by data.index.