lookout.style.typos.corrector
¶
Typo correction model.
Module Contents¶
-
class
lookout.style.typos.corrector.
TyposCorrector
(ranking_config:Optional[Mapping[str, Any]]=None, **kwargs)¶ Bases:
modelforge.Model
Model for correcting typos in tokens inside identifiers.
-
_log
¶
-
NAME
= typos_correction¶
-
VENDOR
= source{d}¶
-
DESCRIPTION
= Model that suggests fixes to correct typos.¶
-
LICENSE
¶
-
processes_number
¶ Return the number of processes for multiprocessing used to train and to predict.
-
initialize_generator
(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None)¶ Construct a new CandidatesGenerator.
Parameters: - vocabulary_file – The path to the vocabulary.
- frequencies_file – The path to the frequencies.
- embeddings_file – The path to the embeddings.
- config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).
-
set_ranking_config
(self, config:Mapping[str, Any])¶ Update the ranking config - see XGBoost docs for details.
Parameters: config – Ranking configuration, options: train_rounds: Number of training rounds (int). early_stopping: Early stopping parameter (int). boost_param: Boosting parameters (dict).
-
set_generation_config
(self, config:Mapping[str, Any])¶ Update the candidates generation config.
Parameters: config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).
-
expand_vocabulary
(self, additional_tokens:Iterable[str])¶ Add given tokens to the model’s vocabulary.
Parameters: additional_tokens – Tokens to add to the vocabulary.
-
train
(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)¶ Train corrector on tokens from the given dataset.
Parameters: - data – DataFrame which contains columns Columns.Token, Columns.CorrectToken, and Columns.Split.
- candidates – A .csv.xz dump of a dataframe with precalculated candidates.
- save_candidates_file – Path to file where to save the candidates (.csv.xz).
-
train_on_file
(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None)¶ Train corrector on tokens from the given file.
Parameters: - data_file – A .csv dump of a dataframe which contains columns Columns.Token, Columns.CorrectToken and Columns.Split.
- candidates – A .csv.xz dump of a dataframe with precalculated candidates.
- save_candidates_file – Path to file where to save the candidates (.csv.xz).
-
suggest
(self, data:pandas.DataFrame, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)¶ Suggest corrections for the tokens from the given dataset.
Parameters: - data – DataFrame which contains columns Columns.Token and Columns.Split.
- candidates – A .csv.xz dump of a dataframe with precalculated candidates.
- save_candidates_file – Path to file to save candidates to (.csv.xz).
- n_candidates – Number of most probable candidates to return.
- return_all – False to return suggestions only for corrected tokens.
Returns: Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.
-
suggest_on_file
(self, data_file:str, candidates:Optional[str]=None, save_candidates_file:Optional[str]=None, n_candidates:int=3, return_all:bool=True)¶ Suggest corrections for the tokens from the given file.
Parameters: - data_file – A .csv dump of a DataFrame which contains columns Columns.Token and Columns.Split.
- candidates – A .csv.xz dump of a dataframe with precalculated candidates.
- save_candidates_file – Path to file to save candidates to (.csv.xz).
- n_candidates – Number of most probable candidates to return.
- return_all – False to return suggestions only for corrected tokens.
Returns: Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.
-
suggest_by_batches
(self, data:pandas.DataFrame, n_candidates:int=3, return_all:bool=True, batch_size:int=2048)¶ Suggest corrections for the tokens from the given dataset by batches. Does not support precalculated candidates.
Parameters: - data – DataFrame which contains columns Columns.Token and Columns.Split.
- n_candidates – Number of most probable candidates to return.
- return_all – False to return suggestions only for corrected tokens.
- batch_size – Batch size.
Returns: Dictionary {id : [(candidate, correctness_proba), …]}, candidates are sorted by correctness probability in a descending order.
-
evaluate
(self, test_data:pandas.DataFrame)¶ Evaluate the corrector on the given test dataset.
Save the result metrics to the model metadata and print it to the standard output. :param test_data: DataFrame which contains column Columns.Token, column Columns.Split is optional, but used when present. :return: Suggestions for correction of tokens inside the test_data and the quality
report.
-
__eq__
(self, other:'TyposCorrector')¶
-
dump
(self)¶ Model.__str__ to format the object.
-
_generate_tree
(self)¶
-
_load_tree
(self, tree:dict)¶
-