:mod:`prepare_dataset` ====================== .. py:module:: prepare_dataset .. autoapi-nested-parse:: Filter and prepare dataset for evaluation. It should be launched on dataset prepared by `typos_preprocessing.ipynb`. Module Contents --------------- .. data:: Changes .. data:: COLUMNS :annotation: = ['identifier', 'correct_id', 'filename', 'line', 'commit', 'repository'] .. data:: NEW_COLUMNS .. data:: COL2IND .. data:: NEW_COL2IND .. py:class:: IdentifierFileCommitRanger(*, filename:str, repository:str, identifier:str, commit:str, directory:Optional[str]=None) Find first commit where identifier was added to the file. .. attribute:: _log .. method:: _run_cmd(self, cmd, step, cwd=None, env=None) .. method:: _clone(self) .. method:: _checkout(self) .. method:: _blame(self, filename=None) .. staticmethod:: _validate_date(text) .. method:: _get_full_hash(self, short_hash) .. method:: _get_diff(self) .. method:: _to_changes(self, line) .. method:: _pipeline(self) .. method:: __call__(self) .. staticmethod:: _find_deleted_file(text, filename=None) .. function:: _parallel_comp(args) .. function:: pipeline(input_csv, output_csv, n_cores=1, cache='/tmp') Find first commit hash of appearing identifier in file. :param input_csv: Path to input csv. :param output_csv: Path to store result csv. :param n_cores: How many cores to use. :param cache: Cache location. If empty - no caching .. function:: parse_args() .. data:: args