lookout.style.typos.symspell
¶
Module Contents¶
-
class
lookout.style.typos.symspell.
SymSpell
(max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)¶ SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster and language independent. Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!
SymSpell supports compound splitting / decompounding of multi-word input strings with three cases: 1. mistakenly inserted space into a correct word led to two incorrect terms 2. mistakenly omitted space between two correct words led to one incorrect combined term 3. multiple independent input terms with/without spelling errors
See https://github.com/wolfgarbe/SymSpell for details.
- Args:
- max_dictionary_edit_distance (int, optional): Maximum distance
- used to generate index. Also acts as an upper bound for max_edit_distance parameter in lookup() method. Defaults to 2.
- prefix_length (int, optional): Prefix length. Should not
- be changed normally. Defaults to 7.
- count_threshold (int, optional): Threshold corpus-count
- value for words to be considered correct. Defaults to 1, values below zero are also mapped to 1. Consider setting a higher value if your corpus contains mistakes.
-
_create_dictionary_entry
(self, key, count)¶ Creates or updates a dictionary entry.
- Args:
- key (str): Word to insert or update. count (int): Count to save or add to existing.
- Returns:
- bool: True if word was added to the dictionary,
- False if word was updated or ignored.
-
load_dictionary
(self, corpus)¶ Loads dictionary from :param:`corpus` file.
File should contain space-separated word-count pairs one at a line.
- Args:
- corpus (str): Path to .csv corpus file.
-
create_dictionary
(self, corpus)¶ Creates dictionary from :param:`corpus` file.
- Note:
- Words are not preprocessed in any way. It is your duty to provide
- appropriate corpus. Also keep in mind that the distance used to generate index is specified at initialization. Consider doing a purge of below threshold words afterwards.
- Args:
- corpus (str): Path to corpus file.
-
purge_below_threshold_words
(self)¶ Purges words below threshold.
- Consider using this method after creating a dictionary to reduce memory usage.
- These words are not used in any way during lookup.
-
lookup
(self, phrase, verbosity, max_edit_distance)¶ Attempts to correct the spelling of :param:`phrase`.
- Note:
- Phrase is not preprocessed in any way.
- Args:
phrase: (str) Word to correct. Should be a valid word. verbosity: (int, 0, 1 or 2) Output toggle. Set to 0 to output
closest most common correction, set to 1 to output closest suggestion, set to 2 to output all suggestions.max_edit_distance: (int) Maximum edit distance to consider.
- Returns:
- list of
SuggestionItem
: Suggested corrections. - Raises:
- AssertionError: If :param:`max_edit_distance` is larger than maximum
- edit distance specified at initialization.
-
lookup_compound
(self, phrase, max_edit_distance)¶ Attempts to correct the spelling of :param:`phrase`.
- Note:
- Phrase is not preprocessed in any way.
- Args:
- phrase (str): Sentence to correct. max_edit_distance (int): Maximum edit distance to consider for each word.
- Returns:
- list of
SuggestionItem
: Length-one list with suggested correction. - Raises:
- AssertionError: If :param:`max_edit_distance` is larger than maximum
- edit distance specified at initialization.
-
_delete_in_suggestion_prefix
(self, delete, delete_len, suggestion, suggestion_len)¶ Helper method to check if :param:`delete` is prefix of :param:`suggestion`.
- Args:
- delete (str): String to look for in prefix. delete_len (int): Length of :param:`delete`. suggestion (str): String to take prefix from. suggestion_len (int): Length of :param:`suggestion`.
- Returns:
- bool: True if :param:`delete` is prefix of :param:`suggestion`, False otherwise.
-
_edits
(self, word, edit_distance, delete_words)¶ helper recursive method to generate deletes.
Refer to article for details.
- Args:
- word (str): Word to generate deletes from. edit_distance (int): Maximum edit distance to consider, recursion depth. delete_words (set): Generated deletes, pass empty set first time.
- Returns:
- delete_words (set): Generated deletes.
-
_edits_prefix
(self, key)¶
-
_hash
(self, s)¶
-
_parse_words
(self, text, filters='!"#$%&()*+, -./:;<=>?@[\]^_`{|}~tn', lower=True, split=' ')¶