lookout.style.typos.generation

Generation of the typo correction candidates. Contains features extraction and serialization.

Module Contents

lookout.style.typos.generation.TypoInfo
lookout.style.typos.generation.Features
class lookout.style.typos.generation.CandidatesGenerator(**kwargs)

Bases:modelforge.Model

Looks for candidates for correction of typos and generates features for them. Candidates are generated in three ways: 1. Closest by cosine distance of embeddings to the given token. 2. Closest by cosine distance to the compound vector of token context. 3. Closest by the edit distance and most frequent tokens from vocabulary.

NAME = candidates_generator
VENDOR = source{d}
DESCRIPTION = Model that generates candidates to fix typos.
LICENSE
NO_COMPRESSION = ['/wv/vectors/']
construct(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None)

Construct correction candidates generator.

Parameters:
  • vocabulary_file – Text file used to generate vocabulary of correction candidates. First token in every line split is added to the vocabulary.
  • frequencies_file – Path to the text file with frequencies. Each line must be two values separated with a whitespace: “token count”.
  • embeddings_file – Path to the dump of FastText model.
  • config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens on equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int). set_min_freq: True to set the frequency of the unknown tokens to the minimum frequency in the vocabulary. It is set to zero otherwise.
set_config(self, config:Optional[Mapping[str, Any]]=None)

Update candidates generation config.

Parameters:config – Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int).
expand_vocabulary(self, additional_tokens:Iterable[str])

Add given tokens to the generator’s vocabulary.

Parameters:additional_tokens – Tokens to add to the vocabulary.
generate_candidates(self, data:pandas.DataFrame, processes_number:int, save_candidates_file:Optional[str]=None)

Generate candidates for typos inside data.

Parameters:
  • data – DataFrame which contains column Columns.Token.
  • processes_number – Number of processes for multiprocessing.
  • save_candidates_file – File to save candidates to.
Returns:

DataFrame containing candidates for corrections and features for their ranking for each typo.

dump(self)

Represent the candidates generator.

__eq__(self, other:'CandidatesGenerator')
_lookup_corrections_for_token(self, typo_info:TypoInfo)
_get_candidate_tokens(self, typo_info:TypoInfo)
_generate_features(self, typo_info:TypoInfo, dist:int, typo_vec:numpy.ndarray, candidate:str, candidate_vec:numpy.ndarray)

Compile features for a single correction candidate.

Parameters:
  • typo_info – instance of TypoInfo class.
  • dist – edit distance from candidate to typo.
  • typo_vec – embedding of the original token.
  • candidate – candidate token.
  • candidate_vec – embedding of the candidate token.
Returns:

index, typo and candidate tokens, frequencies info, cosine distances between embeggings and contexts, edit distance between the tokens, embeddings of the tokens and contexts.

_vec(self, token:str)
_freq(self, token:str)
static _cos(first_vec:numpy.ndarray, second_vec:numpy.ndarray)
_min_cos(self, typo_vec:numpy.ndarray, context:str)
_avg_cos(self, typo_vec:numpy.ndarray, context:str)
_closest(self, item:Union[numpy.ndarray, str], quantity:int)
_freq_relation(self, first_token:str, second_token:str)
_compound_vec(self, text:str)
_generate_tree(self)
_load_tree(self, tree:dict)
lookout.style.typos.generation.get_candidates_features(candidates:pandas.DataFrame)

Take the feature vectors belonging to the typo correction candidates from the table.

lookout.style.typos.generation.get_candidates_metadata(candidates:pandas.DataFrame)

Take the information about the typo correction candidates from the table.