lookout.style.typos.research.nn_prediction

Module Contents

lookout.style.typos.research.nn_prediction.extract_embeddings_from_fasttext(fasttext:FastText, tokens:Iterable[str])

Convert the embeddings from FastText to a dense matrix.

Parameters:
  • fasttext – trained embeddings.
  • tokens – list of tokens - axis Y of the returned matrix.
Returns:

matrix with extracted embeddings.

lookout.style.typos.research.nn_prediction.get_features(fasttext:FastText, typos:Sequence[str])
lookout.style.typos.research.nn_prediction.get_target(fasttext:FastText, identifiers:Iterable[str])
lookout.style.typos.research.nn_prediction.generator(features:numpy.ndarray, target:numpy.ndarray, batch_size:numpy.ndarray)

Pumps the data for keras.Model.fit_generator()

Parameters:
  • features – Inputs.
  • target – Labels.
  • batch_size – Batch size.
Returns:

Another batch for fit_generator().

lookout.style.typos.research.nn_prediction.create_model(num_neurons:int, input_len:int, output_len:int)

Builds the fully-connected NN.

Parameters:
  • num_neurons – Number of neurons in each hidden layer.
  • input_len – Input size.
  • output_len – Output size.
Returns:

Built model.

lookout.style.typos.research.nn_prediction.train_model(model:keras.models.Sequential, features:numpy.ndarray, target:numpy.ndarray, save_model_file:str=None, batch_size:int=64, lr:float=0.1, decay:float=1e-07, num_epochs:int=100)
lookout.style.typos.research.nn_prediction.DEFAULT_NUM_NEURONS = 256
lookout.style.typos.research.nn_prediction.DEFAULT_BATCH_SIZE = 64
lookout.style.typos.research.nn_prediction.DEFAULT_LR = 0.1
lookout.style.typos.research.nn_prediction.DEFAULT_DECAY = 0.9
lookout.style.typos.research.nn_prediction.DEFAULT_NUM_EPOCHS = 10
lookout.style.typos.research.nn_prediction.create_and_train_nn_prediction(fasttext:FastText, data:pandas.DataFrame, saved_model_file:str, num_neurons:int=DEFAULT_NUM_NEURONS, batch_size:int=DEFAULT_BATCH_SIZE, lr:float=DEFAULT_LR, decay:float=DEFAULT_DECAY, num_epochs:int=DEFAULT_NUM_EPOCHS)

Train NN model for correction embedding prediction.

Parameters:
  • fasttext – gensim.models.Fasttext model.
  • data – DataFrame containing columns [Columns.CorrectToken, Columns.Token].
  • saved_model_file – Path to file to dump trained NN model.
  • num_neurons – Number of neurons in each hidden layer.
  • batch_size – Batch size for training.
  • lr – Learning rate.
  • decay – Learning rate exponential decay per epoch.
  • num_epochs – Number of passes over the train dataset.
Returns:

Trained Keras model.

lookout.style.typos.research.nn_prediction.get_predictions(fasttext:FastText, model:keras.models.Sequential, typos:Iterable[str])

Get predicted correction embeddings for tokens from typos.

Parameters:
  • fasttext – gensim.models.FastText model.
  • model – Trained NN model.
  • typos – Iterable with tokens to check.
Returns:

Array of predicted correction embeddings.

lookout.style.typos.research.nn_prediction.create_and_train_nn_prediction_from_file(fasttext:str, data:str, dump:str=None, num_neurons:int=DEFAULT_NUM_NEURONS, batch_size:int=DEFAULT_BATCH_SIZE, lr:float=DEFAULT_LR, decay:float=DEFAULT_DECAY, num_epochs:int=DEFAULT_NUM_EPOCHS)

Train NN model for correction embedding prediction from files.

Parameters:
  • fasttext – Path to the binary dump of a FastText model.
  • data – Path to a CSV dump of pandas.DataFrame containing columns [Columns.CorrectToken, Columns.Token].
  • dump – Path to the file where to dump the trained NN model.
  • num_neurons – Number of neurons in each hidden layer.
  • batch_size – Batch size for training.
  • lr – Learning rate.
  • decay – Learning rate exponential decay per epoch.
  • num_epochs – Number of training passes over the dataset.
Returns:

Trained Keras model.