prepare_dataset

Filter and prepare dataset for evaluation. It should be launched on dataset prepared by typos_preprocessing.ipynb.

Module Contents

prepare_dataset.Changes
prepare_dataset.COLUMNS = ['identifier', 'correct_id', 'filename', 'line', 'commit', 'repository']
prepare_dataset.NEW_COLUMNS
prepare_dataset.COL2IND
prepare_dataset.NEW_COL2IND
class prepare_dataset.IdentifierFileCommitRanger(*, filename:str, repository:str, identifier:str, commit:str, directory:Optional[str]=None)

Find first commit where identifier was added to the file.

_log
_run_cmd(self, cmd, step, cwd=None, env=None)
_clone(self)
_checkout(self)
_blame(self, filename=None)
static _validate_date(text)
_get_full_hash(self, short_hash)
_get_diff(self)
_to_changes(self, line)
_pipeline(self)
__call__(self)
static _find_deleted_file(text, filename=None)
prepare_dataset._parallel_comp(args)
prepare_dataset.pipeline(input_csv, output_csv, n_cores=1, cache='/tmp')

Find first commit hash of appearing identifier in file.

Parameters:
  • input_csv – Path to input csv.
  • output_csv – Path to store result csv.
  • n_cores – How many cores to use.
  • cache – Cache location. If empty - no caching
prepare_dataset.parse_args()
prepare_dataset.args