lookout.style.typos.corruption

Module Contents

lookout.style.typos.corruption.letters
lookout.style.typos.corruption.rand_insert(token:str)

Add a random letter inside the token.

lookout.style.typos.corruption.rand_delete(token:str)

Delete a random symbol from the token.

lookout.style.typos.corruption.rand_substitution(token:str)

Substitute a random symbol with a letter inside the token.

lookout.style.typos.corruption.rand_swap(token:str)

Swap two random consequent symbols inside the token.

lookout.style.typos.corruption._rand_typo(token_split:Tuple[str, str, bool], add_typo_probability:float)
lookout.style.typos.corruption.corrupt_tokens_in_df(data:pandas.DataFrame, typo_probability:float, add_typo_probability:float, processes_number:Optional[int]=None, log_level:int=logging.DEBUG)

Create artificial typos in tokens (identifiers) in a pandas DataFrame. Augment some of the identifiers from the dataframe with typo_probability, the consequent typos in the same word happen with add_typo_probability each. Operations run out-of-place.

Parameters:
  • data – Dataframe which contains columns Columns.Token and Columns.Split.
  • typo_probability – Probability with which a token gets to be corrupted.
  • add_typo_probability – Probability with which one more corruption happens to a corrupted token.
  • processes_number – Number of processes for multiprocessing. If not set the number of CPUs in the system is used.
  • log_level – Level of logging.
Returns:

New dataframe with added columns Columns.CorrectToken and Columns.CorrectSplit, which contain tokens and corresponding splits from the data. Columns.Token and Columns.Split now contain partially corrupted tokens and corresponding splits.