Skip to content

Commit

Permalink
Fix errors reported by CRAN
Browse files Browse the repository at this point in the history
- The CRAN urls in the overview.Rmd, features.Rmd and README.Rmd files
  need to be in canonical form.
- Remove the link to the wordpredictor demo from README.Rmd.
- Add Shiny demo to the wordpredictor package.
- Add instructions on how to access the demo to README.Rmd.
- Remove the bold style from wordpredictor in README.Rmd.
  • Loading branch information
pakjiddat committed Jun 14, 2021
1 parent ddb67e3 commit 4473e3e
Show file tree
Hide file tree
Showing 15 changed files with 224 additions and 82 deletions.
30 changes: 18 additions & 12 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ clean_up <- function(ve) {
[![test-coverage](https://github.com/pakjiddat/word-predictor/workflows/test-coverage/badge.svg)](https://github.com/pakjiddat/word-predictor/actions)
<!-- badges: end -->

The goal of the **wordpredictor** package is to provide a flexible and easy to use framework for generating [n-gram models](https://en.wikipedia.org/wiki/N-gram) for word prediction.
The goal of the wordpredictor package is to provide a flexible and easy to use framework for generating [n-gram models](https://en.wikipedia.org/wiki/N-gram) for word prediction.

The package allows generating n-gram models from input text files. It also allows exploring n-grams using plots. Additionally it provides methods for measuring n-gram model performance using [Perplexity](https://en.wikipedia.org/wiki/Perplexity) and accuracy.

Expand All @@ -71,7 +71,7 @@ devtools::install_github("pakjiddat/word-predictor")
```

## Package structure
The **wordpredictor** package is based on **R6 classes**. It is easy to customize and improve. It provides the following classes:
The wordpredictor package is based on **R6 classes**. It is easy to customize and improve. It provides the following classes:

1. **DataAnalyzer**. It allows analyzing n-grams.
2. **DataCleaner**. It allows cleaning text files. It supports several data cleaning options.
Expand Down Expand Up @@ -150,7 +150,7 @@ clean_up(ve)

## Analyzing N-grams

The **wordpredictor** package includes a class called **DataAnalyzer**, that can be used to get an idea of the frequency distribution of n-grams in a model. The model generation process described above, creates an n-gram file in the model directory.
The wordpredictor package includes a class called **DataAnalyzer**, that can be used to get an idea of the frequency distribution of n-grams in a model. The model generation process described above, creates an n-gram file in the model directory.

For each n-gram number less than or equal to the n-gram size of the model, a n-gram file is generated. In the example above the n-gram size of the model is 4. So 4 n-gram files are generated in the model folder. These files are: **n1.RDS, n2.RDS, n3.RDS and n4.RDS**. The **n2.RDS** file contains n-grams of size 2.

Expand Down Expand Up @@ -284,7 +284,7 @@ tg_opts = list(

## Evaluating model performance

The **wordpredictor** package allows evaluating n-gram model performance. It can measure the performance of a single model as well as compare the performance of multiple models. When evaluating the performance of a model, intrinsic and extrinsic evaluation is performed.
The wordpredictor package allows evaluating n-gram model performance. It can measure the performance of a single model as well as compare the performance of multiple models. When evaluating the performance of a model, intrinsic and extrinsic evaluation is performed.

Intrinsic evaluation measures the Perplexity score for each sentence in a validation text file. It returns the minimum, maximum and mean Perplexity score for the sentences.

Expand Down Expand Up @@ -314,32 +314,38 @@ clean_up(ve)

## Demo

A [DEMO](https://pakjiddat.shinyapps.io/word-predictor/) application demonstrates how to make word predictions. It is based on the Shiny package. It allows predicting the next word based on the given set of words. It displays the 10 most likely words along with their respective probabilities.
The wordpredictor package includes a demo called "word-predictor". The demo is a Shiny application that displays the ten most likely words for a given set of words. To access the demo, run the following command from the R shell:

The demo app is based on Shiny platform. It consists of two files. [server.r](https://gist.github.com/pakjiddat/43c61c54b645e5bd0096d6fd75e58127) and [ui.r](https://gist.github.com/pakjiddat/96727c1df77755e5bcf8a7d4ff731dea). The n-gram model file must be present in the same folder as the two files. It can be generated using the ModelGenerator class.
**`demo("word-predictor", package = "wordpredictor", ask = F)`**.

The following is a screenshot of the demo:

```{r demo, out.width="70%", out.height="70%", echo=F}
knitr::include_graphics("man/figures/README-demo.png")
```

## Website

The [wordpredictor website](https://pakjiddat.github.io/word-predictor/) provides details about how the packages works. It includes code samples and details of all the classes and methods.
The [wordpredictor website](https://pakjiddat.github.io/word-predictor/) provides details about how the package works. It includes code samples and details of all the classes and methods.

## Benefits

The **wordpredictor** package provides an easy to use framework for working with n-gram models. It allows n-gram model generation, performance evaluation and word prediction.
The wordpredictor package provides an easy to use framework for working with n-gram models. It allows n-gram model generation, performance evaluation and word prediction.

## Limitations

The n-gram language model requires a lot of memory for storing the n-grams. The **wordpredictor** package has been tested on a machine with dual core processor and 4 GB of RAM. It works well for input data files of size less than 40 Mb and n-gram size 4. For larger data files and n-gram size, more memory and CPU power will be needed.
The n-gram language model requires a lot of memory for storing the n-grams. The wordpredictor package has been tested on a machine with dual core processor and 4 GB of RAM. It works well for input data files of size less than 40 Mb and n-gram size 4. For larger data files and n-gram size, more memory and CPU power will be needed.

## Future Work

The **wordpredictor** package may be extended by adding support for different smoothing techniques such as [Good-Turing](https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation), [Katz-Back-off](https://en.wikipedia.org/wiki/Katz%27s_back-off_model) and handling of [Out Of Vocabulary Words](https://en.wikipedia.org/wiki/N-gram#Out-of-vocabulary_words).
The wordpredictor package may be extended by adding support for different smoothing techniques such as [Good-Turing](https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation), [Katz-Back-off](https://en.wikipedia.org/wiki/Katz%27s_back-off_model) and handling of [Out Of Vocabulary Words](https://en.wikipedia.org/wiki/N-gram#Out-of-vocabulary_words).

Support for different types of n-gram models such as [Skip-Grams](https://en.wikipedia.org/wiki/N-gram#Skip-gram) and [Syntatic n-grams](https://en.wikipedia.org/wiki/N-gram#Syntactic_n-grams).

The **wordpredictor** package is used for predicting words. It may be extended to support other use cases such as spelling correction, biological sequence analysis, data compression and more. This will require further performance optimization.
The wordpredictor package is used for predicting words. It may be extended to support other use cases such as spelling correction, biological sequence analysis, data compression and more. This will require further performance optimization.

The source code is organized using R6 classes. It is easy to extend. Contributions are welcome !.

## Acknowledgments

I was motivated to develop the **wordpredictor** package after taking the courses in the [Data Science Specialization](https://www.coursera.org/specializations/jhu-data-science) offered by John Hopkins university on Coursera. I would like to thank the course instructors for making the courses interesting and motivating for the students.
I was motivated to develop the wordpredictor package after taking the courses in the [Data Science Specialization](https://www.coursera.org/specializations/jhu-data-science) offered by John Hopkins university on Coursera. I would like to thank the course instructors for making the courses interesting and motivating for the students.
60 changes: 29 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
[![test-coverage](https://github.com/pakjiddat/word-predictor/workflows/test-coverage/badge.svg)](https://github.com/pakjiddat/word-predictor/actions)
<!-- badges: end -->

The goal of the **wordpredictor** package is to provide a flexible and
easy to use framework for generating [n-gram
The goal of the wordpredictor package is to provide a flexible and easy
to use framework for generating [n-gram
models](https://en.wikipedia.org/wiki/N-gram) for word prediction.

The package allows generating n-gram models from input text files. It
Expand Down Expand Up @@ -39,7 +39,7 @@ devtools::install_github("pakjiddat/word-predictor")

## Package structure

The **wordpredictor** package is based on **R6 classes**. It is easy to
The wordpredictor package is based on **R6 classes**. It is easy to
customize and improve. It provides the following classes:

1. **DataAnalyzer**. It allows analyzing n-grams.
Expand Down Expand Up @@ -135,10 +135,10 @@ clean_up(ve)

## Analyzing N-grams

The **wordpredictor** package includes a class called **DataAnalyzer**,
that can be used to get an idea of the frequency distribution of n-grams
in a model. The model generation process described above, creates an
n-gram file in the model directory.
The wordpredictor package includes a class called **DataAnalyzer**, that
can be used to get an idea of the frequency distribution of n-grams in a
model. The model generation process described above, creates an n-gram
file in the model directory.

For each n-gram number less than or equal to the n-gram size of the
model, a n-gram file is generated. In the example above the n-gram size
Expand Down Expand Up @@ -294,10 +294,10 @@ tg_opts = list(

## Evaluating model performance

The **wordpredictor** package allows evaluating n-gram model
performance. It can measure the performance of a single model as well as
compare the performance of multiple models. When evaluating the
performance of a model, intrinsic and extrinsic evaluation is performed.
The wordpredictor package allows evaluating n-gram model performance. It
can measure the performance of a single model as well as compare the
performance of multiple models. When evaluating the performance of a
model, intrinsic and extrinsic evaluation is performed.

Intrinsic evaluation measures the Perplexity score for each sentence in
a validation text file. It returns the minimum, maximum and mean
Expand Down Expand Up @@ -333,42 +333,40 @@ clean_up(ve)

## Demo

A [DEMO](https://pakjiddat.shinyapps.io/word-predictor/) application
demonstrates how to make word predictions. It is based on the Shiny
package. It allows predicting the next word based on the given set of
words. It displays the 10 most likely words along with their respective
probabilities.
The wordpredictor package includes a demo called “word-predictor”. The
demo is a Shiny application that displays the ten most likely words for
a given set of words. To access the demo, run the following command from
the R shell:

The demo app is based on Shiny platform. It consists of two files.
[server.r](https://gist.github.com/pakjiddat/43c61c54b645e5bd0096d6fd75e58127)
and
[ui.r](https://gist.github.com/pakjiddat/96727c1df77755e5bcf8a7d4ff731dea).
The n-gram model file must be present in the same folder as the two
files. It can be generated using the ModelGenerator class.
**`demo("word-predictor", package = "wordpredictor", ask = F)`**.

The following is a screenshot of the demo:

<img src="man/figures/README-demo.png" width="70%" height="70%" />

## Website

The [wordpredictor website](https://pakjiddat.github.io/word-predictor/)
provides details about how the packages works. It includes code samples
provides details about how the package works. It includes code samples
and details of all the classes and methods.

## Benefits

The **wordpredictor** package provides an easy to use framework for
working with n-gram models. It allows n-gram model generation,
performance evaluation and word prediction.
The wordpredictor package provides an easy to use framework for working
with n-gram models. It allows n-gram model generation, performance
evaluation and word prediction.

## Limitations

The n-gram language model requires a lot of memory for storing the
n-grams. The **wordpredictor** package has been tested on a machine with
n-grams. The wordpredictor package has been tested on a machine with
dual core processor and 4 GB of RAM. It works well for input data files
of size less than 40 Mb and n-gram size 4. For larger data files and
n-gram size, more memory and CPU power will be needed.

## Future Work

The **wordpredictor** package may be extended by adding support for
The wordpredictor package may be extended by adding support for
different smoothing techniques such as
[Good-Turing](https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation),
[Katz-Back-off](https://en.wikipedia.org/wiki/Katz%27s_back-off_model)
Expand All @@ -380,7 +378,7 @@ Support for different types of n-gram models such as
[Syntatic
n-grams](https://en.wikipedia.org/wiki/N-gram#Syntactic_n-grams).

The **wordpredictor** package is used for predicting words. It may be
The wordpredictor package is used for predicting words. It may be
extended to support other use cases such as spelling correction,
biological sequence analysis, data compression and more. This will
require further performance optimization.
Expand All @@ -390,8 +388,8 @@ Contributions are welcome \!.

## Acknowledgments

I was motivated to develop the **wordpredictor** package after taking
the courses in the [Data Science
I was motivated to develop the wordpredictor package after taking the
courses in the [Data Science
Specialization](https://www.coursera.org/specializations/jhu-data-science)
offered by John Hopkins university on Coursera. I would like to thank
the course instructors for making the courses interesting and motivating
Expand Down
1 change: 1 addition & 0 deletions demo/00Index
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
word-predictor A shiny application that shows the top ten predicted words
120 changes: 120 additions & 0 deletions demo/word-predictor.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# This is the demo word-predictor application. You can run the application by
# clicking 'Run App' above.
#
# The application allows users to enter a set of words. For the given words the
# application attempts to predict the top ten most likely words. These words are
# presented in a bar plot along with the respective probabilities.
#
# Find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com/

library(shiny)
library(ggplot2)
library(wordpredictor)

# Define UI for application that draws a histogram
ui <- fluidPage(

# Application title
titlePanel("Word Predictor"),
# Horizontal rule
hr(),

# Sidebar with a slider input for number of bins
sidebarLayout(
sidebarPanel(
# The input field
textInput("ngram", "Enter a n-gram:", value = "where is")
),

# Show a plot of the possible predicted words
mainPanel(
# The predicted word
uiOutput("next_word"),
# The predicted word probability
uiOutput("word_prob"),
# Horizontal rule
hr(),
# The bar plot of possible next words
plotOutput("next_word_plot")
)
)
)

# Define server logic required to draw a histogram
server <- function(input, output) {

# The model file path
sfp <- system.file("extdata", "def-model.RDS", package = "wordpredictor")
# The ModelPredictor object is created
mp <- ModelPredictor$new(sfp)
# The predicted word information
p <- NULL

# The next word is predicted
output$next_word <- renderUI({
# If the user entered some text
if (trimws(input$ngram) != "") {
# The text entered by the user is split on space
w <- trimws(input$ngram)
# The next word is predicted
p <- mp$predict_word(w, 10)
# If the next word was not found
if (!p$found) {
# The next word and next word is set to an information
# message
nw <- span("Not Found", style = "color:red")
# The next word probability is set to an information
# message
nwp <- span("N.A", style = "color:red")
# The plot is set to empty
output$next_word_plot <- renderPlot({})
# The predicted next word
nw <- tags$div("Predicted Word: ", tags$strong(nw))
# The predicted next word probability
nwp <- tags$div("Word Probability: ", tags$strong(nwp))
# The next word probability is updated
output$word_prob <- renderUI(nwp)
}
else {
# The next word
nw <- p$words[[1]]
# The next word probability
nwp <- p$probs[[1]]
# The plot is updated
output$next_word_plot <- renderPlot({
# A data frame containing the data to plot
df <- data.frame("word" = p$words, "prob" = p$probs)
# The data frame is sorted in descending order
df <- (df[order(df$prob, decreasing = T),])
# The words and their probabilities are plotted
g <- ggplot(data = df, aes(x = reorder(word, prob), y = prob)) +
geom_bar(stat = "identity", fill = "red") +
ggtitle("Predicted words and their probabilities") +
ylab("Probability") +
xlab("Word")
print(g)
})
# The predicted next word
nw <- tags$div("Predicted Word: ", tags$strong(nw))
# The predicted next word probability
nwp <- tags$div("Word Probability: ", tags$strong(nwp))
# The next word probability is updated
output$word_prob <- renderUI(nwp)
}
}
else {
# The next word is set to ""
nw <- tags$span()
# The next word probability text is set to ""
output$word_prob <- renderUI(tags$span())
# The plot is set to empty
output$next_word_plot <- renderPlot({})
}
return(nw)
})
}

# Run the application
shinyApp(ui = ui, server = server)
Loading

0 comments on commit 4473e3e

Please sign in to comment.