lookout.style.typos.research.baseline

Module Contents

lookout.style.typos.research.baseline.MAX_DISTANCE = 2
class lookout.style.typos.research.baseline.Baseline(frequencies_file)

Typos correction model, based on SymSpell lookout algorithm

https://github.com/wolfgarbe/SymSpell

and simple Random Forest classifier, based on token frequencies and edit distance between typo and candidate.

Requires file containing tokens frequencies in a format “token, frequency”.

Training data: dataframe indexed by “id” and containing columns “identifier”, “typo”. Testing data: dataframe indexed by “id” and containing column “typo”.

fit(self, train_file, cand_train_file=None)
dump(self, dump_file)
suggest(self, test_file, cand_test_file=None)
correct(self, test_file, cand_file=None)
_freq(self, token)
_lookup_corrections(self, typo_info)
_create_candidates(self, data, cand_file)
_create_labels(self)
_create_matrix(self, candidates)
lookout.style.typos.research.baseline.baseline(args)