filter_dataset

Module Contents

filter_dataset.remove_non_typos(dataset:str, filtered_dataset:str)

Remove non-typo-ed identifiers from the dataset.

  1. Remove examples, where token splits of the wrong and the correct identifiers are equal (they differ in non-alpha chars or casing).
  2. Remove examples, where wrong and correct identifiers are equal on lemmas level.
Parameters:
  • dataset – Path to the dataset.
  • filtered_dataset – Path to save the filtered dataset to.