lookout.style.typos.preparation

Module Contents

class lookout.style.typos.preparation._DownloadProgressBar

Bases:tqdm.tqdm

update_to(self, b:int=1, bsize:int=1, tsize:Optional[int]=None)
lookout.style.typos.preparation._download_url(url:str, output_path:str)
lookout.style.typos.preparation.generate_vocabulary(frequencies_path:str, config:Mapping[str, Any])

Compose vocabulary from a set of tokens with known frequencies.

Filtering of the input tokens depends on their frequencies and edit distances between them. All found English words and tokens that the algorithm considers word-like are added regardless of their frequencies. :param frequencies_path: Path to the .csv file with space-separated word-frequency pairs one-per-line. :param config: Configuration for the vocabulary creation:

stable: How many tokens, which don’t have more frequent edit-distance-neighbors, to take into the vocabulary. suspicious: How many tokens, whose more frequent edit-distance-neighbor is

an English word, to take into the vocabulary.

non_suspicious: How many tokens, whose more frequent edit-distance-neighbor is not an English word, to take into the vocabulary.

Returns:Dictionary with the vocabulary tokens as keys and their corresponding frequencies as values.
lookout.style.typos.preparation.prepare_data(config:Optional[Mapping[str, Any]]=None)

Generate all the necessary data from the raw dataset of split identifiers.

Brief algorithm description: 1. Derive vocabulary for typos correction which is a set of tokens, which is considered

correctly spelled. All typos corrections will belong to the vocabulary. It is a set of most frequent tokens (based on given statistics).
  1. Save vocabulary and statistics for a given amount of most frequent tokens for future use.
  2. Filter raw data, leaving identifiers, containing only tokens from the vocabulary. The result is a dataset of tokens which will be considered correct. It will be used for creating artificial misspelling cases for training and testing the corrector model.

4. Save prepared dataset, if needed. :param config: Dictionary with parameters for data preparation. Used fields are:

data_dir: Directory to put all derived data to. drive_dataset_id: ID of google drive document, where a raw dataset is stored. input_path: Path to a .csv dump of input dataframe. Should contain column Columns.Split. If None or file doesn’t exist,

the dataset will be loaded from Google drive.

frequency_column: Name of the column with identifiers frequencies. If not specified, every split is considered to have frequency 1. vocabulary_size: Number of most frequent tokens to take as a vocabulary. frequencies_size: Number of most frequent tokens to save frequencies info for. This information will be used by corrector as features for these tokens when they will be checked. If not specified, frequencies for all present tokens will be saved. raw_data_filename: Name of the .csv file in data_dir to put raw dataset in case of loading from drive. vocabulary_filename: Name of the .csv file in data_dir to save vocabulary to. frequencies_filename: Name of the .csv file in data_dir to save frequencies to. prepared_filename: Name of the .csv file in data_dir to save prepared dataset to.

Returns:Dataset baked for training the typos correction.
lookout.style.typos.preparation.train_fasttext(data:pandas.DataFrame, config:Optional[Mapping[str, Any]]=None)

Train fasttext model on the given dataset of code identifiers.

Parameters:
  • data – Dataframe with columns Columns.Split and Columns.Frequency.
  • config – Parameters for training the model, options: size: Number of identifiers to pick from the given data to train fasttext on. corrupt: Value indicating whether to make random artificial typos in the training data. Identifiers are corrupted with typo_probability. typo_probability: Token corruption probability if corrupt == True. add_typo_probability: Probability of second corruption in a corrupted token. used if corrupt == True. path: Path where to store the trained fasttext model. dim: Number of dimensions for embeddings in the new model. bucket: Number of hash buckets to keep in the fasttext model: the less there are, the more compact the model gets. adjust_frequencies: Whether to divide frequencies by the number of tokens in the identifiers. Needs to be done when the result of the prepare function is used as data to have a true identifiers distribution.
lookout.style.typos.preparation.get_datasets(prepared_data:pandas.DataFrame, config:Optional[Mapping[str, Any]]=None)

Create the train and the test datasets of typos.

  1. Take the specified number of lines from the input dataset.
  2. Make artificial typos in picked identifiers and split them into train and test.

3. Return results. :param prepared_data: Dataframe of correct splitted identifiers. Must contain columns Columns.Split, Columns.Frequency and Columns.Token. :param config: Parameters for creating train and test datasets, options:

train_size: Train dataset size. test_size: Test dataset size. typo_probability: Probability of token corruption. add_typo_probability: Probability of second corruption for a corrupted token. train_path: Path to the .csv file where to save the train dataset. test_path: Path to the .csv file where to save the test dataset. processes_number: Number of processes for multiprocessing.
Returns:Train and test datasets.
lookout.style.typos.preparation.train_and_evaluate(train_data:pandas.DataFrame, test_data:pandas.DataFrame, vocabulary_path:str, frequencies_path:str, fasttext_path:str, generation_config:Optional[Mapping[str, Any]]=None, ranking_config:Optional[Mapping[str, Any]]=None, processes_number:int=DEFAULT_CORRECTOR_CONFIG['processes_number'])

Create and train TyposCorrector model on the given data.

Parameters:
  • train_data – Dataframe which contains columns Columns.Token, Columns.Split and Columns.CorrectToken.
  • test_data – Dataframe which contains columns Columns.Token, Columns.Split and Columns.CorrectToken.
  • vocabulary_path – Path to a file with vocabulary.
  • frequencies_path – Path to a file with tokens’ frequencies.
  • fasttext_path – Path to a FastText model dump.
  • generation_config – Candidates generation configuration.
  • ranking_config – Ranking configuration.
  • processes_number – Number of processes for multiprocessing.
Returns:

Trained model.

lookout.style.typos.preparation.train_from_scratch(config:Optional[Mapping[str, Any]]=None)

Train TyposCorrector on raw data.

  1. Prepare data, for more info check prepare_data().
  2. Construct train and test datasets, for more info check get_train_test().
  3. Train and evaluate TyposCorrector model, for more info check train_and_evaluate().

4. Return result. :param config: Parameters for data preparation and corrector training. :return: Trained TyposCorrector model.