Next Word Prediction using NLP

Building a word predictor using Natural Language Processing in R

Telvis Calhoun
technicalelvis.com

Goals

Word Prediction using N-Grams

Assume the training data shows the frequency of "data" is 198, "data entry" is 12 and "data streams" is 10. We calculate the maximum likelihood estimate (MLE) as:

The probability of "data entry":

\[ P_{mle}(entry|data) = \frac{12}{198} = 0.06 = 6\% \]

The probability of "data streams" is: \[ P_{mle}(streams|data) = \frac{10}{198} = 0.05 = 5\% \]

If the user types, "data", the model predicts that "entry" is the most likely next word.

Modeling

  1. Generate 2-grams, 3-grams and 4-grams.
  2. Select n-grams that account for 66% of word instances. This reduces the size of the models.
  3. Calculate the maximum likelihood estimate (MLE) for words for each model.

Prediction

  1. Load the ngram models
  2. Tokenize and preprocess user input.
  3. Implement stupid backoff starting on 4-gram model backing off to 3-gram model backing off to 2-gram model.
  4. Return 3 words with largest MLE

Demo Application