:mod:`lookout.style.typos.generation` ===================================== .. py:module:: lookout.style.typos.generation .. autoapi-nested-parse:: Generation of the typo correction candidates. Contains features extraction and serialization. Module Contents --------------- .. data:: TypoInfo .. data:: Features .. py:class:: CandidatesGenerator(**kwargs) Bases::class:`modelforge.Model` Looks for candidates for correction of typos and generates features for them. Candidates are generated in three ways: 1. Closest by cosine distance of embeddings to the given token. 2. Closest by cosine distance to the compound vector of token context. 3. Closest by the edit distance and most frequent tokens from vocabulary. .. attribute:: NAME :annotation: = candidates_generator .. attribute:: VENDOR :annotation: = source{d} .. attribute:: DESCRIPTION :annotation: = Model that generates candidates to fix typos. .. attribute:: LICENSE .. attribute:: NO_COMPRESSION :annotation: = ['/wv/vectors/'] .. method:: construct(self, vocabulary_file:str, frequencies_file:str, embeddings_file:str, config:Optional[Mapping[str, Any]]=None) Construct correction candidates generator. :param vocabulary_file: Text file used to generate vocabulary of correction candidates. First token in every line split is added to the vocabulary. :param frequencies_file: Path to the text file with frequencies. Each line must be two values separated with a whitespace: "token count". :param embeddings_file: Path to the dump of FastText model. :param config: Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens on equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int). set_min_freq: True to set the frequency of the unknown tokens to the minimum frequency in the vocabulary. It is set to zero otherwise. .. method:: set_config(self, config:Optional[Mapping[str, Any]]=None) Update candidates generation config. :param config: Candidates generation configuration, options: neighbors_number: Number of neighbors of context and typo embeddings to consider as candidates (int). edit_dist_number: Number of the most frequent tokens among tokens at equal edit distance from the typo to consider as candidates (int). max_distance: Maximum edit distance for symspell lookup for candidates (int). radius: Maximum edit distance from typo allowed for candidates (int). max_corrected_length: Maximum length of prefix in which symspell lookup for typos is conducted (int). start_pool_size: Length of data, starting from which multiprocessing is desired (int). chunksize: Max size of a chunk for one process during multiprocessing (int). .. method:: expand_vocabulary(self, additional_tokens:Iterable[str]) Add given tokens to the generator's vocabulary. :param additional_tokens: Tokens to add to the vocabulary. .. method:: generate_candidates(self, data:pandas.DataFrame, processes_number:int, save_candidates_file:Optional[str]=None) Generate candidates for typos inside data. :param data: DataFrame which contains column Columns.Token. :param processes_number: Number of processes for multiprocessing. :param save_candidates_file: File to save candidates to. :return: DataFrame containing candidates for corrections and features for their ranking for each typo. .. method:: dump(self) Represent the candidates generator. .. method:: __eq__(self, other:'CandidatesGenerator') .. method:: _lookup_corrections_for_token(self, typo_info:TypoInfo) .. method:: _get_candidate_tokens(self, typo_info:TypoInfo) .. method:: _generate_features(self, typo_info:TypoInfo, dist:int, typo_vec:numpy.ndarray, candidate:str, candidate_vec:numpy.ndarray) Compile features for a single correction candidate. :param typo_info: instance of TypoInfo class. :param dist: edit distance from candidate to typo. :param typo_vec: embedding of the original token. :param candidate: candidate token. :param candidate_vec: embedding of the candidate token. :return: index, typo and candidate tokens, frequencies info, cosine distances between embeggings and contexts, edit distance between the tokens, embeddings of the tokens and contexts. .. method:: _vec(self, token:str) .. method:: _freq(self, token:str) .. staticmethod:: _cos(first_vec:numpy.ndarray, second_vec:numpy.ndarray) .. method:: _min_cos(self, typo_vec:numpy.ndarray, context:str) .. method:: _avg_cos(self, typo_vec:numpy.ndarray, context:str) .. method:: _closest(self, item:Union[numpy.ndarray, str], quantity:int) .. method:: _freq_relation(self, first_token:str, second_token:str) .. method:: _compound_vec(self, text:str) .. method:: _generate_tree(self) .. method:: _load_tree(self, tree:dict) .. function:: get_candidates_features(candidates:pandas.DataFrame) Take the feature vectors belonging to the typo correction candidates from the table. .. function:: get_candidates_metadata(candidates:pandas.DataFrame) Take the information about the typo correction candidates from the table.