lookout.style.typos.corrector

Typo correction model.

Module Contents

class lookout.style.typos.corrector.TyposCorrector(ranking_config:Optional[Mapping[str, Any]]=None, **kwargs)

Bases:modelforge.Model

Model for correcting typos in tokens inside identifiers.

_log
NAME = typos_correction
VENDOR = source{d}
DESCRIPTION = Model that suggests fixes to correct typos.
LICENSE
processes_number

Return the number of processes for multiprocessing used to train and to predict.

initialize_generator(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None)

Construct a new CandidatesGenerator.

Parameters:
  • vocabulary_file – The path to the vocabulary.
  • frequencies_file – The path to the frequencies.
  • embeddings_file – The path to the embeddings.
  • config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).
set_ranking_config(self, config:Mapping[str, Any])

Update the ranking config - see XGBoost docs for details.

Parameters:config – Ranking configuration, options: train_rounds: Number of training rounds (int). early_stopping: Early stopping parameter (int). boost_param: Boosting parameters (dict).
set_generation_config(self, config:Mapping[str, Any])

Update the candidates generation config.

Parameters:config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).
expand_vocabulary(self, additional_tokens:Iterable[str])

Add given tokens to the model’s vocabulary.

Parameters:additional_tokens – Tokens to add to the vocabulary.
train(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)

Train corrector on tokens from the given dataset.

Parameters:
  • data – DataFrame which contains columns Columns.Token, Columns.CorrectToken, and Columns.Split.
  • candidates – A .csv.xz dump of a dataframe with precalculated candidates.
  • save_candidates_file – Path to file where to save the candidates (.csv.xz).
train_on_file(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)

Train corrector on tokens from the given file.

Parameters:
  • data_file – A .csv dump of a dataframe which contains columns Columns.Token, Columns.CorrectToken and Columns.Split.
  • candidates – A .csv.xz dump of a dataframe with precalculated candidates.
  • save_candidates_file – Path to file where to save the candidates (.csv.xz).
suggest(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)

Suggest corrections for the tokens from the given dataset.

Parameters:
  • data – DataFrame which contains columns Columns.Token and Columns.Split.
  • candidates – A .csv.xz dump of a dataframe with precalculated candidates.
  • save_candidates_file – Path to file to save candidates to (.csv.xz).
  • n_candidates – Number of most probable candidates to return.
  • return_all – False to return suggestions only for corrected tokens.
Returns:

Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

suggest_on_file(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)

Suggest corrections for the tokens from the given file.

Parameters:
  • data_file – A .csv dump of a DataFrame which contains columns Columns.Token and Columns.Split.
  • candidates – A .csv.xz dump of a dataframe with precalculated candidates.
  • save_candidates_file – Path to file to save candidates to (.csv.xz).
  • n_candidates – Number of most probable candidates to return.
  • return_all – False to return suggestions only for corrected tokens.
Returns:

Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

suggest_by_batches(self, data:pandas.DataFrame, n_candidates:int=3, return_all:bool=True, batch_size:int=2048)

Suggest corrections for the tokens from the given dataset by batches. Does not support precalculated candidates.

Parameters:
  • data – DataFrame which contains columns Columns.Token and Columns.Split.
  • n_candidates – Number of most probable candidates to return.
  • return_all – False to return suggestions only for corrected tokens.
  • batch_size – Batch size.
Returns:

Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

evaluate(self, test_data:pandas.DataFrame)

Evaluate the corrector on the given test dataset.

Save the result metrics to the model metadata and print it to the standard output. :param test_data: DataFrame which contains column Columns.Token, column Columns.Split is optional, but used when present. :return: Suggestions for correction of tokens inside the test_data and the quality

report.
__eq__(self, other:'TyposCorrector')
dump(self)

Model.__str__ to format the object.

_generate_tree(self)
_load_tree(self, tree:dict)