lookout.style.typos.corruption
¶
Module Contents¶
-
lookout.style.typos.corruption.
letters
¶
-
lookout.style.typos.corruption.
rand_insert
(token:str)¶ Add a random letter inside the token.
-
lookout.style.typos.corruption.
rand_delete
(token:str)¶ Delete a random symbol from the token.
-
lookout.style.typos.corruption.
rand_substitution
(token:str)¶ Substitute a random symbol with a letter inside the token.
-
lookout.style.typos.corruption.
rand_swap
(token:str)¶ Swap two random consequent symbols inside the token.
-
lookout.style.typos.corruption.
_rand_typo
(token_split:Tuple[str, str, bool], add_typo_probability:float)¶
-
lookout.style.typos.corruption.
corrupt_tokens_in_df
(data:pandas.DataFrame, typo_probability:float, add_typo_probability:float, processes_number:Optional[int]=None, log_level:int=logging.DEBUG)¶ Create artificial typos in tokens (identifiers) in a pandas DataFrame. Augment some of the identifiers from the dataframe with typo_probability, the consequent typos in the same word happen with add_typo_probability each. Operations run out-of-place.
Parameters: - data – Dataframe which contains columns Columns.Token and Columns.Split.
- typo_probability – Probability with which a token gets to be corrupted.
- add_typo_probability – Probability with which one more corruption happens to a corrupted token.
- processes_number – Number of processes for multiprocessing. If not set the number of CPUs in the system is used.
- log_level – Level of logging.
Returns: New dataframe with added columns Columns.CorrectToken and Columns.CorrectSplit, which contain tokens and corresponding splits from the data. Columns.Token and Columns.Split now contain partially corrupted tokens and corresponding splits.