lookout.style.typos.research.baseline¶
Module Contents¶
-
lookout.style.typos.research.baseline.MAX_DISTANCE= 2¶
-
class
lookout.style.typos.research.baseline.Baseline(frequencies_file)¶ Typos correction model, based on SymSpell lookout algorithm
https://github.com/wolfgarbe/SymSpell
and simple Random Forest classifier, based on token frequencies and edit distance between typo and candidate.
Requires file containing tokens frequencies in a format “token, frequency”.
Training data: dataframe indexed by “id” and containing columns “identifier”, “typo”. Testing data: dataframe indexed by “id” and containing column “typo”.
-
fit(self, train_file, cand_train_file=None)¶
-
dump(self, dump_file)¶
-
suggest(self, test_file, cand_test_file=None)¶
-
correct(self, test_file, cand_file=None)¶
-
_freq(self, token)¶
-
_lookup_corrections(self, typo_info)¶
-
_create_candidates(self, data, cand_file)¶
-
_create_labels(self)¶
-
_create_matrix(self, candidates)¶
-
-
lookout.style.typos.research.baseline.baseline(args)¶