
Module Contents


Add a random letter inside the token.


Delete a random symbol from the token.


Substitute a random symbol with a letter inside the token.


Swap two random consequent symbols inside the token.

lookout.style.typos.corruption._rand_typo(token_split:Tuple[str, str, bool], add_typo_probability:float)
lookout.style.typos.corruption.corrupt_tokens_in_df(data:pandas.DataFrame, typo_probability:float, add_typo_probability:float, processes_number:Optional[int]=None, log_level:int=logging.DEBUG)

Create artificial typos in tokens (identifiers) in a pandas DataFrame. Augment some of the identifiers from the dataframe with typo_probability, the consequent typos in the same word happen with add_typo_probability each. Operations run out-of-place.

  • data – Dataframe which contains columns Columns.Token and Columns.Split.
  • typo_probability – Probability with which a token gets to be corrupted.
  • add_typo_probability – Probability with which one more corruption happens to a corrupted token.
  • processes_number – Number of processes for multiprocessing. If not set the number of CPUs in the system is used.
  • log_level – Level of logging.

New dataframe with added columns Columns.CorrectToken and Columns.CorrectSplit, which contain tokens and corresponding splits from the data. Columns.Token and Columns.Split now contain partially corrupted tokens and corresponding splits.