:mod:`lookout.style.typos.utils` ================================ .. py:module:: lookout.style.typos.utils .. autoapi-nested-parse:: Various glue functions to work with the input dataset and the output from FastText. Module Contents --------------- .. data:: Columns .. data:: Candidate .. data:: TEMPLATE_DIR .. function:: filter_splits(data:pandas.DataFrame, tokens:Set[str]) Leave rows in a dataframe whose splits' tokens all belong to some vocabulary. :param data: Dataframe which contains column Columns.Split. :param tokens: Set of tokens (reference vocabulary). :return: Filtered dataframe. .. function:: print_frequencies(frequencies:Dict[str, int], path:str) Print frequencies of tokens to a file. Frequencies info is obtained from id_stats dataframe. :param frequencies: Dictionary of tokens' frequencies. :param path: Path to a .csv file to print frequencies to. .. function:: read_frequencies(file:str) Read token frequencies from the file. :param file: Path to the .csv file with space-separated word-frequency pairs one-per-line. :return: Dictionary of tokens frequencies. .. function:: read_vocabulary(file:str) Read vocabulary tokens from the text file. :param file: .csv file in which the vocabulary of corrections candidates is stored. First token in every line split-by-space is added to the vocabulary. :return: List of tokens of the vocabulary. .. function:: flatten_df_by_column(data:pandas.DataFrame, column:str, new_column:str, apply_function=lambda x: x) Flatten DataFrame by `column` with extracted elements put to `new_column`. Operation runs out-of-place. :param data: DataFrame to flatten. :param column: Column to expand. :param new_column: Column to populate with elements from flattened column. :param apply_function: Function used to expand every element of flattened column. :return: Flattened DataFrame. .. function:: add_context_info(data:pandas.DataFrame) Split context of identifier on before and after part and return new dataframe with the info. :param data: DataFrame, containing columns Columns.Token and Columns.Split. :return: New dataframe with added columns Columns.Before and Columns.After, containing corresponding parts of the context from column Columns.Split. .. function:: rank_candidates(candidates:pandas.DataFrame, pred_probs:Sequence[float], n_candidates:Optional[int]=None, return_all:bool=True) Rank candidates for tokens' correction based on the correctness probabilities. :param candidates: DataFrame with columns Columns.Id, Columns.Token, Columns.Candidate and indexed by range(len(pred_proba)). :param pred_probs: Array of probabilities of correctness of every candidate. :param n_candidates: Number of most probably correct candidates to return for each typo. :param return_all: False to return corrections only for tokens corrected in the first candidate. :return: Dictionary `{id : [Candidate, ...]}`, candidates are sorted by correct_prob in a descending order. .. function:: suggestions_to_df(data:pandas.DataFrame, suggestions:Dict[int, List[Candidate]]) Convert suggestions from dictionary to pandas.DataFrame. :param data: DataFrame containing column Columns.Token. :param suggestions: Dictionary of suggestions, keys correspond with data.index. :return: DataFrame with columns Columns.Token, Columns.Suggestions, indexed by data.index. .. function:: suggestions_to_flat_df(data:pandas.DataFrame, suggestions:Dict[int, List[Tuple[str, float]]]) Convert suggestions from dictionary to pandas.DataFrame, flattened by suggestions column. :param data: DataFrame containing column Columns.Token. :param suggestions: Dictionary of suggestions, keys correspond with data.index. :return: DataFrame with columns Columns.Token, Columns.Candidate, Columns.Probability, indexed by data.index.