`lookout.style.typos.corrector`¶

Typo correction model.

Module Contents¶

class lookout.style.typos.corrector.TyposCorrector(ranking_config:Optional[Mapping[str, Any]]=None, **kwargs)¶

Bases:modelforge.Model

Model for correcting typos in tokens inside identifiers.

_log¶

NAME = typos_correction¶

VENDOR = source{d}¶

DESCRIPTION = Model that suggests fixes to correct typos.¶

LICENSE¶

processes_number¶: Return the number of processes for multiprocessing used to train and to predict.

initialize_generator(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None)¶

Construct a new CandidatesGenerator.

Parameters:

vocabulary_file – The path to the vocabulary.
frequencies_file – The path to the frequencies.
embeddings_file – The path to the embeddings.
config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).

set_ranking_config(self, config:Mapping[str, Any])¶

Update the ranking config - see XGBoost docs for details.

Parameters:	config – Ranking configuration, options: train_rounds: Number of training rounds (int). early_stopping: Early stopping parameter (int). boost_param: Boosting parameters (dict).

set_generation_config(self, config:Mapping[str, Any])¶

Update the candidates generation config.

Parameters: config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).

expand_vocabulary(self, additional_tokens:Iterable[str])¶

Add given tokens to the model’s vocabulary.

Parameters:	additional_tokens – Tokens to add to the vocabulary.

train(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)¶

Train corrector on tokens from the given dataset.

Parameters:	data – DataFrame which contains columns Columns.Token, Columns.CorrectToken, and Columns.Split. candidates – A .csv.xz dump of a dataframe with precalculated candidates. save_candidates_file – Path to file where to save the candidates (.csv.xz).

train_on_file(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)¶

Train corrector on tokens from the given file.

Parameters:	data_file – A .csv dump of a dataframe which contains columns Columns.Token, Columns.CorrectToken and Columns.Split. candidates – A .csv.xz dump of a dataframe with precalculated candidates. save_candidates_file – Path to file where to save the candidates (.csv.xz).

suggest(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)¶

Suggest corrections for the tokens from the given dataset.

Parameters:	data – DataFrame which contains columns Columns.Token and Columns.Split. candidates – A .csv.xz dump of a dataframe with precalculated candidates. save_candidates_file – Path to file to save candidates to (.csv.xz). n_candidates – Number of most probable candidates to return. return_all – False to return suggestions only for corrected tokens.
Returns:	Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

suggest_on_file(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)¶

Suggest corrections for the tokens from the given file.

Parameters:

data_file – A .csv dump of a DataFrame which contains columns Columns.Token and Columns.Split.
candidates – A .csv.xz dump of a dataframe with precalculated candidates.
save_candidates_file – Path to file to save candidates to (.csv.xz).
n_candidates – Number of most probable candidates to return.
return_all – False to return suggestions only for corrected tokens.

Returns:

Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

suggest_by_batches(self, data:pandas.DataFrame, n_candidates:int=3, return_all:bool=True, batch_size:int=2048)¶

Suggest corrections for the tokens from the given dataset by batches. Does not support precalculated candidates.

Parameters:	data – DataFrame which contains columns Columns.Token and Columns.Split. n_candidates – Number of most probable candidates to return. return_all – False to return suggestions only for corrected tokens. batch_size – Batch size.
Returns:	Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.

evaluate(self, test_data:pandas.DataFrame)¶

Evaluate the corrector on the given test dataset.

Save the result metrics to the model metadata and print it to the standard output. :param test_data: DataFrame which contains column Columns.Token, column Columns.Split is optional, but used when present. :return: Suggestions for correction of tokens inside the test_data and the quality

report.

__eq__(self, other:'TyposCorrector')¶

dump(self)¶: Model.__str__ to format the object.

_generate_tree(self)¶

_load_tree(self, tree:dict)¶

`lookout.style.typos.corrector`¶

Module Contents¶

Lookout Style Analyzer

Navigation

Related Topics

lookout.style.typos.corrector¶

Module Contents¶

`lookout.style.typos.corrector`¶