Instruction:
We’ve created an object in your workspace called new_text
containing several sentences.
qdap
package.new_text
to the console.term_count
consisting of the 10 most frequent terms in new_text
.term_count
.# Load qdap
library(qdap)
# Print new_text to the console
new_text
# Find the 10 most frequent terms: term_count
term_count <- freq_terms(new_text, 10)
# Plot term_count
plot(term_count)
The data has been loaded for you and is available in coffee_data_file
.
tweets
using read.csv()
on the file coffee_data_file
, which contains tweets mentioning coffee. Remember to add stringsAsFactors = FALSE
!tweets
object using str()
to determine which column has the text you’ll want to analyze.coffee_tweets
object using only the text column you identified earlier. To do so, use the $
operator and column name.Instruction:
# Import text data
tweets <- read.csv('https://assets.datacamp.com/production/course_935/datasets/coffee.csv', stringsAsFactors = F)
# View the structure of tweets
str(tweets)
# Isolate text from tweets: coffee_tweets
coffee_tweets <- tweets$text
Instruction:
tm
package.coffee_tweets
vector. Call this new object coffee_source
.# Load tm
library(tm)
# Make a vector source from coffee_tweets
coffee_source <- VectorSource(coffee_tweets)
Instruction:
VCorpus()
function on the coffee_source
object to create coffee_corpus
.coffee_corpus
is a VCorpus
object by printing it to the console.coffee_corpus
to the console to verify that it’s a PlainTextDocument
that contains the content and metadata of the 15th tweet. Use double bracket subsetting.coffee_corpus
. Use double brackets to select the proper tweet, followed by single brackets to extract the content of that tweet.content()
of the 10th tweet within coffee_corpus
## coffee_source is already in your workspace
# Make a volatile corpus: coffee_corpus
coffee_corpus <- VCorpus(coffee_source)
# Print out coffee_corpus
coffee_corpus
# Print the 15th tweet in coffee_corpus
coffee_corpus[[15]]
# Print the contents of the 15th tweet in coffee_corpus
coffee_corpus[[15]]$content
# Now use content to review plain text of the 10th tweet
content(coffee_corpus[[10]])
Instruction:
In your workspace, there’s a simple data frame called example_text
with the correct column names and some metadata. There is also vec_corpus
which is a volatile corpus made with VectorSource()
df_source
using DataframeSource()
with the example_text.df_corpus
by converting df_source
to a volatile corpus object with VCorpus()
.df_corpus
. Notice how many documents it contains and the number of retained document level metadata points.meta()
on df_corpus
to print the document associated metadata.vec_corpus
object. Compare the number of documents to df_corpus
.meta()
on vec_corpus
to compare any metadata found between vec_corpus
and df_corpus
.# Create a DataframeSource from the example text
df_source <- DataframeSource(example_text)
# Convert df_source to a volatile corpus
df_corpus <- VCorpus(df_source)
# Examine df_corpus
df_corpus
# Examine df_corpus metadata
meta(df_corpus)
# Compare the number of documents in the vector source
vec_corpus
# Compare metadata in the vector corpus
meta(vec_corpus)
Instruction:
Apply each of the following functions to text
, simply printing results to the console:
tolower()
removePunctuation()
removeNumbers()
stripWhitespace()
# Create the object: text
text <- "<b>She</b> woke up at 6 A.M. It\'s so early! She was only 10% awake and began drinking coffee in front of her computer."
# Make lowercase
tolower(text)
# Remove punctuation
removePunctuation(text)
# Remove numbers
removeNumbers(text)
# Remove whitespace
stripWhitespace(text)
Instruction:
Apply the following functions to the text
object from the previous exercise:
## text is still loaded in your workspace
# Remove text within brackets
bracketX(text)
# Replace numbers with words
replace_number(text)
# Replace abbreviations
replace_abbreviation(text)
# Replace contractions
replace_contraction(text)
# Replace symbols with words
replace_symbol(text)
Instruction:
stopwords("en")
.text
.new_stops
.new_stops
, from text
.## text is preloaded into your workspace
# List standard English stop words
stopwords("en")
# Print text without standard stop words
removeWords(text, stopwords("en"))
# Add "coffee" and "bean" to the list: new_stops
new_stops <- c("coffee", "bean", stopwords("en"))
# Remove stop words from text
removeWords(text, new_stops)
Instruction:
complicate
consisting of the words “complicated”, “complication”, and “complicatedly” in that order.complicate
to an object called stem_doc
.complete_text
by applying stemCompletion()
to stem_doc
. Re-complete the words using comp_dict
as the reference corpus.complete_text
to the console.# Create complicate
complicate <- c("complicated", "complication", "complicatedly")
# Perform word stemming: stem_doc
stem_doc <- stemDocument(complicate)
# Create the completion dictionary: comp_dict
comp_dict <- c("complicate")
# Perform stem completion: complete_text
complete_text <- stemCompletion(stem_doc, comp_dict)
# Print complete_text
complete_text
Instruction:
The document text_data
and the completion dictionary comp_dict
are loaded in your workspace.
text_data
using removePunctuation()
, assigning to rm_punc
.strsplit()
on rm_punc
with the split
argument set equal to " "
. Nest this inside unlist()
, assigning to n_char_vec
.stemDocument()
again to perform word stemming on n_char_vec
, assigning to stem_doc
.complete_doc
by re-completing your stemmed document with stemCompletion()
and using comp_dict
as your reference corpus.Are stem_doc
and complete_doc
what you expected?
# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)
# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))
# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc, comp_dict)
# Print complete_doc
complete_doc
Instruction 1:
clean_corpus()
in the sample code to apply (in order):
tm
's removePunctuation()
.tolower()
."mug"
to the stop words list.tm
's stripWhitespace()
.# Alter the function code to match the instructions
clean_corpus <- function(corpus) {
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Transform to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Add more stopwords
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", "mug"))
# Strip whitespace
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
Instruction 2:
clean_corp
by applying clean_corpus()
to the included corpus tweet_corp
.clean_corp
using indexing [[227]]
and content()
.tweets$text
tweet using [227]
.# Alter the function code to match the instructions
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, words = c(stopwords("en"), "coffee", "mug"))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
}
# Apply your customized function to the tweet_corp: clean_corp
clean_corp <- clean_corpus(tweet_corp)
# Print out a cleaned up tweet
clean_corp[[227]][1]
# Print out the same tweet in original form
tweet_corp[[227]][1]
Instruction:
coffee_dtm
by applying DocumentTermMatrix()
to clean_corp
.coffee_m
, a matrix version of coffee_dtm
, using as.matrix()
.coffee_m
to the console using the dim()
function. Note the number of rows and columns.coffee_m
containing documents (rows) 25 through 35 and terms (columns) "star"
and "starbucks"
.# Create the document-term matrix from the corpus
coffee_dtm <- DocumentTermMatrix(clean_corp)
# Print out coffee_dtm data
coffee_dtm
# Convert coffee_dtm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_dtm)
# Print the dimensions of coffee_m
dim(coffee_m)
# Review a portion of the matrix to get some Starbucks
coffee_m[25:35, c("star", "starbucks")]
Instruction:
coffee_tdm
by applying TermDocumentMatrix()
to clean_corp
.coffee_m
by converting coffee_tdm
to a matrix using as.matrix()
.coffee_m
to the console. Note the number of rows and columns.coffee_m
containing terms (rows) "star"
and "starbucks"
and documents (columns) 25 through 35.# Create a term-document matrix from the corpus
coffee_tdm <- TermDocumentMatrix(clean_corp)
# Print coffee_tdm data
coffee_tdm
# Convert coffee_tdm to a matrix: coffee_m
coffee_m <- as.matrix(coffee_tdm)
# Print the dimensions of the matrix
dim(coffee_m)
# Review a portion of the matrix
coffee_m[c("star", "starbucks"), 25:35]
Instruction:
coffee_m
as a matrix using the term-document matrix coffee_tdm
from the last chapter.term_frequency
using the rowSums()
function on coffee_m
.term_frequency
in descending order and store the result in term_frequency
.[
, to print the top 10 terms from term_frequency
.## coffee_tdm is still loaded in your workspace
# Convert coffee_tdm to a matrix
coffee_m <- as.matrix(coffee_tdm)
# Calculate the row sums of coffee_m
term_frequency <- rowSums(coffee_m)
# Sort term_frequency in decreasing order
term_frequency <- sort(term_frequency, decreasing = T)
# View the top 10 most common words
term_frequency[1:10]
# Plot a barchart of the 10 most common words
barplot(term_frequency[1:10], col = "tan", las = 2)
Instruction 1:
frequency
using the freq_terms()
function on tweets$text
. Include arguments to accomplish the following:
"Top200Words"
to define stop words.plot()
of the frequency
object. Compare it to the plot you produced in the previous exercise.# Create frequency
frequency <- freq_terms(
tweets$text,
top = 10,
at.least = 3,
stopwords = "Top200Words"
)
# Make a frequency barchart
plot(frequency)
Instruction 2:
frequency
using the freq_terms()
function on tweets$text
. Include the following arguments:
stopwords("english")
to define stop words.plot()
of frequency
. Compare it to the plot of frequency
. Do certain words change based on the stop words criterion?# Create frequency
frequency <- freq_terms(tweets$text,
top = 10,
at.least = 3,
stopwords = stopwords("english")
)
# Make a frequency barchart
plot(frequency)
Instruction:
wordcloud
package.term_frequency
.names()
on term_frequency
. Call the vector of strings terms_vec
.wordcloud()
using terms_vec
as the words, and term_frequency
as the values. Add the parameters max.words = 50
and colors = "red"
.# Load wordcloud package
library(wordcloud)
# Print the first 10 entries in term_frequency
term_frequency[1:10]
# Vector of terms
terms_vec <- names(term_frequency)
# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, term_frequency, max.words = 50, colors = "red")
Instruction:
content()
to the 24th document in chardonnay_corp
."chardonnay"
to the English stopwords, assigning to stops
.stops
.cleaned_chardonnay_corp
with tm_map()
by passing in the chardonnay_corp
, the function removeWords
and finally the stopwords, stops
.content
of the 24
tweet again to compare results.# Review a "cleaned" tweet
content(chardonnay_corp[[24]])
# Add to stopwords
stops <- c(stopwords(kind = 'en'), 'chardonnay')
# Review last 6 stopwords
tail(stops)
# Apply to a corpus
cleaned_chardonnay_corp <- tm_map(chardonnay_corp, removeWords, stops)
# Review a "cleaned" tweet again
content(cleaned_chardonnay_corp[[24]])
Instruction:
We’ve loaded the wordcloud
package for you behind the scenes and will do so for all additional exercises requiring it.
chardonnay_words
with decreasing = TRUE
. Save as sorted_chardonnay_words
.sorted_chardonnay_words
and their values.terms_vec
using names()
on chardonnay_words
.terms_vec
and chardonnay_words
into the wordcloud()
function. Review what other words pop out now that “chardonnay” is removed.# Sort the chardonnay_words in descending order
sorted_chardonnay_words <- sort(chardonnay_words, decreasing = TRUE)
# Print the 6 most frequent chardonnay terms
head(sorted_chardonnay_words)
# Get a terms vector
terms_vec <- names(chardonnay_words)
# Create a wordcloud for the values in word_freqs
wordcloud(terms_vec, chardonnay_words, max.words = 50, colors = "red")
Instruction:
colors()
function to list all basic colors.wordcloud()
using the predefined chardonnay_freqs
with the colors “grey80”, “darkgoldenrod1”, and “tomato”. Include the top 100 terms using max.words
.# Print the list of colors
colors()
# Print the wordcloud with the specified colors
wordcloud(chardonnay_freqs$term,
chardonnay_freqs$num,
max.words = 100,
colors = c("grey80","darkgoldenrod1", "tomato"))
Instruction:
cividis()
to select 5
colors in an object called color_pal
.color_pal
to your console.wordcloud()
from the chardonnay_freqs
term
and num
columns. Include the top 100 terms using max.words
, and set the colors
to your palette, color_pal
.# Select 5 colors
color_pal <- cividis(5)
# Examine the palette output
color_pal
# Create a wordcloud with the selected palette
wordcloud(chardonnay_freqs$term,
chardonnay_freqs$num,
max.words = 100,
colors = color_pal)
Instruction:
all_coffee
by using paste()
with collapse = " "
on coffee_tweets$text
.all_chardonnay
by using paste()
with collapse = " "
on chardonnay_tweets$text
.all_tweets
using c()
to combine all_coffee
and all_chardonnay
. Make all_coffee
the first term.all_tweets
using VectorSource()
.all_corpus
by using VCorpus()
on all_tweets
.# Create all_coffee
all_coffee=paste(coffee_tweets$text,collapse=" ")
# Create all_chardonnay
all_chardonnay=paste(chardonnay_tweets$text,collapse=" ")
# Create all_tweets
all_tweets=c(all_coffee,all_chardonnay)
# Convert to a vector source
all_tweets=VectorSource(all_tweets)
# Create all_corpus
all_corpus=VCorpus(all_tweets)
Instruction:
all_clean
by applying the predefined clean_corpus()
function to all_corpus
.all_tdm
, a TermDocumentMatrix
from all_clean.all_m
by converting all_tdm
to a matrix object.commonality.cloud()
from all_m
with max.words = 100
and colors = "steelblue1"
.# Clean the corpus
all_clean <- clean_corpus(all_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)
# Create all_m
all_m <- as.matrix(all_tdm)
# Print a commonality cloud
commonality.cloud(all_m, max.words = 100, colors = "steelblue1")
Instruction:
all_corpus
is preloaded in your workspace.
all_clean
by applying the predefined clean_corpus
function to all_corpus
.all_tdm
, a TermDocumentMatrix
, from all_clean
.colnames()
to rename each distinct corpora within all_tdm
. Name the first column “coffee” and the second column “chardonnay”.all_m
by converting all_tdm
into matrix form.comparison.cloud()
using all_m
, with colors = c("orange", "blue")
and max.words = 50
.# Clean the corpus
all_clean <- clean_corpus(all_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_clean)
# Give the columns distinct names
colnames(all_tdm) <- c("coffee", "chardonnay")
# Create all_m
all_m <- as.matrix(all_tdm)
# Create comparison cloud
comparison.cloud(all_m,
colors = c("orange", "blue"),
max.words = 50)
Instruction 1:
all_tdm_m
to a data frame. Set the rownames to a column named "word"
.. > 0
.difference
, equal to the count in the chardonnay
column minus the count in the coffee
column.25
rows by difference
.desc()
ending order of difference
.top25_df <- all_tdm_m %>%
# Convert to data frame
as_data_frame(rownames = "word") %>%
# Keep rows where word appears everywhere
filter_all(all_vars(. > 0)) %>%
# Get difference in counts
mutate(difference = chardonnay - coffee) %>%
# Keep rows with biggest difference
top_n(25, wt = difference) %>%
# Arrange by descending difference
arrange(desc(difference))
Instruction 2:
chardonnay
column.coffee
column.word
column.top25_df <- all_tdm_m %>%
# Convert to data frame
as_data_frame(rownames = "word") %>%
# Keep rows where word appears everywhere
filter_all(all_vars(. > 0)) %>%
# Get difference in counts
mutate(difference = chardonnay - coffee) %>%
# Keep rows with biggest difference
top_n(25, wt = difference) %>%
# Arrange by descending difference
arrange(desc(difference))
pyramid.plot(
# Chardonnay counts
top25_df$chardonnay,
# Coffee counts
top25_df$coffee,
# Words
labels = top25_df$word,
top.labels = c("Chardonnay", "Words", "Coffee"),
main = "Words in Common",
unit = NULL,
gap = 8,
)
Instruction:
Update the word_associate()
plotting code to work with the coffee data.
coffee_tweets$text
."barista"
."chardonnay"
to "coffee"
in the stopwords too."Barista Coffee Tweet Associations"
in the sample code for the plot.# Word association
word_associate(coffee_tweets$text, match.string = "barista",
stopwords = c(Top200Words, "coffee", "amp"),
network.plot = TRUE, cloud.colors = c("gray85", "darkred"))
# Add title
title(main = "Barista Coffee Tweet Associations")
Instruction:
A hierarchical cluster object, hc
, has been created for you from the coffee tweets.
Create a dendrogram using plot()
on hc
.
# Plot a dendrogram
plot(hc)
Instruction:
The data frame rain
has been preloaded in your workspace.
dist_rain
by using the dist()
function on the values in the second column of rain.dist_rain
matrix to the console.hc
by performing a cluster analysis, using hclust()
on dist_rain
.plot()
the hc
object with labels = rain$city
to add the city names.# Create dist_rain
dist_rain <- dist(rain[, 2])
# View the distance matrix
dist_rain
# Create hc
hc <- hclust(dist_rain)
# Plot hc
plot(hc, labels = rain$city)
Instruction:
tweets_tdm
has been created using the chardonnay tweets.
tweets_tdm
to the console.tdm1
using removeSparseTerms()
with sparse = 0.95
on tweets_tdm
.tdm2
using removeSparseTerms()
with sparse = 0.975
on tweets_tdm
.tdm1
to the console to see how many terms are left.tdm2
to the console to see how many terms are left.# Print the dimensions of tweets_tdm
dim(tweets_tdm)
# Create tdm1
tdm1 <- removeSparseTerms(tweets_tdm, sparse = 0.95)
# Create tdm2
tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Print tdm1
tdm1
# Print tdm2
tdm2
Instruction:
tweets_tdm2
by applying removeSparseTerms()
on tweets_tdm
. Use sparse = 0.975
.tdm_m
by using as.matrix()
on tweets_tdm2
to convert it to matrix form.tweets_dist
containing the distances of tdm_m
using the dist()
function.hc
using hclust()
on tweets_dist
.plot()
and hc
.# Create tweets_tdm2
tweets_tdm2 <- removeSparseTerms(tweets_tdm, sparse = 0.975)
# Create tdm_m
tdm_m <- as.matrix(tweets_tdm2)
# Create tweets_dist
tweets_dist <- dist(tdm_m)
# Create hc
hc <- hclust(tweets_dist)
# Plot the dendrogram
plot(hc)
Instruction:
The dendextend
package has been loaded for you, and a hierarchical cluster object, hc
, was created from tweets_dist
.
hcd
as a dendrogram using as.dendrogram()
on hc
.labels
of hcd
to the console.branches_attr_by_labels()
to color the branches. Pass it three arguments: the hcd
object, c("marvin", "gaye")
, and the color "red"
. Assign to hcd_colored
.plot()
the dendrogram hcd_colored
with the title "Better Dendrogram"
, added using the main
argument.rect.dendrogram()
. Specify k = 2
clusters and a border
color of "grey50"
.# Create hcd
hcd <- as.dendrogram(hc)
# Print the labels in hcd
labels(hcd)
# Change the branch color to red for "marvin" and "gaye"
hcd_colored <- branches_attr_by_labels(hcd, c("marvin", "gaye"), color = "red")
# Plot hcd
plot(hcd_colored, main = "Better Dendrogram")
# Add cluster rectangles
rect.dendrogram(hcd_colored, k = 2, border = "grey50")
Instruction:
associations
using findAssocs()
on tweets_tdm
to find terms associated with “venti”, which meet a minimum threshold of 0.2
.associations
to the console.associations_df
, by calling list_vect2df()
, passing associations
, then setting col2
to "word"
and col3
to "score"
.ggplot2
code to make a dot plot of the association values.# Create associations
associations <- findAssocs(tweets_tdm, "venti", 0.2)
# View the venti associations
associations
# Create associations_df
associations_df <- list_vect2df(associations, col2 = "word", col3 = "score")
# Plot the associations_df values
ggplot(associations_df, aes(score, word)) +
geom_point(size = 3) +
theme_gdocs()
Instruction:
A corpus
has been preprocessed as before using the chardonnay tweets. The resulting object text_corp
is available in your workspace.
tokenizer
function like the above which creates 2-word bigrams.unigram_dtm
by calling DocumentTermMatrix()
on text_corp
without using the tokenizer()
function.bigram_dtm
using DocumentTermMatrix()
on text_corp
with the tokenizer()
function you just made.unigram_dtm
and bigram_dtm
. Which has more terms?# Make tokenizer function
tokenizer <- function(x)
NGramTokenizer(x, Weka_control(min = 2, max = 2))
# Create unigram_dtm
unigram_dtm <- DocumentTermMatrix(text_corp)
# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(
text_corp,
control = list(tokenize = tokenizer))
# Print unigram_dtm
unigram_dtm
# Print bigram_dtm
bigram_dtm
Instruction:
The chardonnay tweets have been cleaned and organized into a DTM called bigram_dtm
.
bigram_dtm_m
by converting bigram_dtm
to a matrix.freq
consisting of the word frequencies by applying colSums()
on bigram_dtm_m
.names(freq)
and assign the result to bi_words
.bi_words
to str_subset()
with the matching pattern "^marvin"
to review all bigrams starting with “marvin”.wordcloud()
passing bi_words
, freq
and max.words = 15
into the function.# Create bigram_dtm_m
bigram_dtm_m <- as.matrix(bigram_dtm)
# Create freq
freq <- colSums(bigram_dtm_m)
# Create bi_words
bi_words <- names(freq)
# Examine part of bi_words
str_subset(bi_words, pattern = "^marvin")
# Plot a wordcloud
wordcloud(bi_words, freq, max.words = 15)
Instruction 1:
tdm
, a term frequency-based TermDocumentMatrix()
using text_corp
.tdm_m
by converting tdm
to matrix form.tdm_m
to get rows c("coffee", "espresso", "latte")
and columns 161 to 166.# Create a TDM
tdm <- TermDocumentMatrix(text_corp)
# Convert it to a matrix
tdm_m <- as.matrix(tdm)
# Examine part of the matrix
tdm_m[c("coffee", "espresso", "latte"), 161:166]
Instruction 2:
TermDocumentMatrix()
to use TfIdf
weighting. Pass control = list(weighting = weightTfIdf)
as an argument to the function.# Edit the controls to use Tfidf weighting
tdm <- TermDocumentMatrix(text_corp,
control = list(weighting = weightTfIdf))
# Convert to matrix again
tdm_m <- as.matrix(tdm)
# Examine the same part: how has it changed?
tdm_m[c("coffee", "espresso", "latte"), 161:166]
Instruction:
tweets
to “doc_id”.DataframeSource()
on the smaller tweets
data frame.clean_corpus()
function.content()
to the first tweet with double brackets such as text_corpus[[1]]
to see the cleaned plain text.meta()
function on the first document with single brackets.Remember, when accessing part of a corpus the double or single brackets make a difference! For this exercise you will use double brackets with content()
and single brackets with meta()
.
# Rename columns
names(tweets)[1] <- "doc_id"
# Set the schema: docs
docs <- DataframeSource(tweets)
# Make a clean volatile corpus: text_corpus
text_corpus <- clean_corpus( VCorpus(docs))
# Examine the first doc content
content(text_corpus[[1]])
# Access the first doc metadata
meta(text_corpus[1])
Instruction:
amzn
with str()
to get its dimensions and a preview of the data.amzn_pros
from the positive reviews column amzn$pros
.amzn_cons
from the negative reviews column amzn$cons
.goog
with str()
to get its dimensions and a preview of the data.goog_pros
from the positive reviews column goog$pros
.goog_cons
from the negative reviews column goog$cons
.# Print the structure of amzn
str(amzn)
# Create amzn_pros
amzn_pros <- amzn$pros
# Create amzn_cons
amzn_cons <- amzn$cons
# Print the structure of goog
str(goog)
# Create goog_pros
goog_pros <- goog$pros
# Create goog_cons
goog_cons <- goog$cons
Instruction 1:
qdap_clean()
to amzn_pros
, assigning to qdap_cleaned_amzn_pros
.VectorSource()
) from qdap_cleaned_amzn_pros
, then turn it into a volatile corpus (VCorpus()
), assigning to amzn_p_corp
.amzn_pros_corp
by applying tm_clean()
to amzn_p_corp
.# qdap_clean the text
qdap_cleaned_amzn_pros <- qdap_clean(amzn_pros)
# Source and create the corpus
amzn_p_corp <- VCorpus(VectorSource(qdap_cleaned_amzn_pros))
# tm_clean the corpus
amzn_pros_corp <- tm_clean(amzn_p_corp)
Instruction 2:
qdap_clean()
to amzn_cons
, assigning to qdap_cleaned_amzn_cons
.qdap_cleaned_amzn_cons
, then turn it into a volatile corpus, assigning to amzn_c_corp
.amzn_cons_corp
by applying tm_clean()
to amzn_c_corp
.# qdap_clean the text
qdap_cleaned_amzn_cons <- qdap_clean(amzn_cons)
# Source and create the corpus
amzn_c_corp <- VCorpus(VectorSource(qdap_cleaned_amzn_cons))
# tm_clean the corpus
amzn_cons_corp <- tm_clean(amzn_c_corp)
Instruction 1:
qdap_clean()
to goog_pros
, assigning to qdap_cleaned_goog_pros
.VectorSource()
) from qdap_cleaned_goog_pros
, then turn it into a volatile corpus (VCorpus()
), assigning to goog_p_corp
.goog_pros_corp
by applying tm_clean()
to goog_p_corp
.# qdap_clean the text
qdap_cleaned_goog_pros <- qdap_clean(goog_pros)
# Source and create the corpus
goog_p_corp <- VCorpus(VectorSource(qdap_cleaned_goog_pros))
# tm_clean the corpus
goog_pros_corp <- tm_clean(goog_p_corp)
Instruction 2:
qdap_clean()
to goog_cons
, assigning to qdap_cleaned_goog_cons
.qdap_cleaned_goog_cons
, then turn it into a volatile corpus, assigning to goog_c_corp
.goog_cons_corp
by applying tm_clean()
to goog_c_corp
.# qdap clean the text
qdap_cleaned_goog_cons <- qdap_clean(goog_cons)
# Source and create the corpus
goog_c_corp <- VCorpus(VectorSource(qdap_cleaned_goog_cons))
# tm clean the corpus
goog_cons_corp <- tm_clean(goog_c_corp)
Instruction:
amzn_p_tdm
as a TermDocumentMatrix
from amzn_pros_corp
. Make sure to add control = list(tokenize = tokenizer)
so that the terms are bigrams.amzn_p_tdm_m
from amzn_p_tdm
by using the as.matrix()
function.amzn_p_freq
to obtain the term frequencies from amzn_p_tdm_m
.wordcloud()
using names(amzn_p_freq)
as the words, amzn_p_freq
as their frequencies, and max.words = 25
and color = "blue"
for aesthetics.# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_tdm_m
amzn_p_tdm_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_tdm_m)
# Plot a wordcloud using amzn_p_freq values
wordcloud(names(amzn_p_freq),
amzn_p_freq,
max.words = 25,
color = "blue")
Instruction:
Create amzn_c_tdm
by converting amzn_cons_corp
into a TermDocumentMatrix
and incorporating the bigram function control = list(tokenize = tokenizer)
.
Create amzn_c_tdm_m
as a matrix version of amzn_c_tdm
.
Create amzn_c_freq
by using rowSums()
to get term frequencies from amzn_c_tdm_m
.
Create a wordcloud()
using names(amzn_c_freq)
and the values amzn_c_freq
. Use the arguments max.words = 25
and color = "red"
as well.
# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer))
# Create amzn_c_tdm_m
amzn_c_tdm_m <- as.matrix(amzn_c_tdm)
# Create amzn_c_freq
amzn_c_freq <- rowSums(amzn_c_tdm_m)
# Plot a wordcloud of negative Amazon bigrams
wordcloud(names(amzn_c_freq), amzn_c_freq,
max.words = 25, color = "red")
Instruction:
amzn_c_tdm
as a TermDocumentMatrix
using amzn_cons_corp
with control = list(tokenize = tokenizer)
.amzn_c_tdm
to the console.amzn_c_tdm2
by applying the removeSparseTerms()
function to amzn_c_tdm
with the sparse argument equal to .993
.hc
, a hierarchical cluster object by nesting the distance matrix dist(amzn_c_tdm2)
inside the hclust()
function. Make sure to also pass method = "complete"
to the hclust()
function.hc
to view the clustered bigrams and see how the concepts in the Amazon cons section may lead you to a conclusion.# Create amzn_c_tdm
amzn_c_tdm <- TermDocumentMatrix(
amzn_cons_corp,
control = list(tokenize = tokenizer))
# Print amzn_c_tdm to the console
amzn_c_tdm
# Create amzn_c_tdm2 by removing sparse terms
amzn_c_tdm2 <- removeSparseTerms(amzn_c_tdm, sparse = .993)
# Create hc as a cluster of distance values
hc <- hclust(
d = dist(amzn_c_tdm2), method = "complete")
# Produce a plot of hc
plot(hc)
Instruction:
The amzn_pros_corp
corpus has been cleaned using the custom functions like before.
Construct a TDM called amzn_p_tdm
from amzn_pros_corp
and control = list(tokenize = tokenizer)
.
Create amzn_p_m
by converting amzn_p_tdm
to a matrix.
Create amzn_p_freq
by applying rowSums()
to amzn_p_m
.
Create term_frequency
using sort()
on amzn_p_freq
along with the argument decreasing = TRUE
.
Examine the first 5 bigrams using term_frequency[1:5]
.
You may be surprised to see “fast paced” as a top term because it could be a negative term related to “long hours”. Look at the terms most associated with “fast paced”. Use findAssocs()
on amzn_p_tdm
to examine "fast paced"
with a 0.2
cutoff.
# Create amzn_p_tdm
amzn_p_tdm <- TermDocumentMatrix(
amzn_pros_corp,
control = list(tokenize = tokenizer)
)
# Create amzn_p_m
amzn_p_m <- as.matrix(amzn_p_tdm)
# Create amzn_p_freq
amzn_p_freq <- rowSums(amzn_p_m)
# Create term_frequency
term_frequency <- sort(amzn_p_freq, decreasing = T)
# Print the 5 most common terms
term_frequency[1:5]
# Find associations with fast paced
associations <- findAssocs(amzn_p_tdm, "fast paced", 0.2)
associations
Instruction:
The all_goog_corpus
object consisting of Google pro and con reviews is loaded in your workspace.
Create all_goog_corp
by cleaning all_goog_corpus
with the predefined tm_clean()
function.
Create all_tdm
by converting all_goog_corp
to a term-document matrix.
Create all_m
by converting all_tdm
to a matrix.
Construct a comparison.cloud()
from all_m
. Set max.words
to 100
. The colors
argument is specified for you.
# Create all_goog_corp
all_goog_corp <- tm_clean(all_goog_corpus)
# Create all_tdm
all_tdm <- TermDocumentMatrix(all_goog_corp)
# Create all_m
all_m <- as.matrix(all_tdm)
# Build a comparison cloud
comparison.cloud(all_m,
colors = c("#F44336", "#2196f3"),
max.words = 100)
Instruction:
common_words
from all_tdm_df
using dplyr
functions.
filter()
on the AmazonPro
column for nonzero values.GooglePro
column for nonzero values.mutate()
a new column, diff
which is the abs
(absolute) difference between the term frequencies columns.top5_df
by applying top_n
to common_words
to extract the top 5
values in the diff
column. It will print to your console for review.pyramid.plot
passing in top5_df$AmazonPro
then top5_df$GooglePro
and finally add labels with top5_df$terms
.# Filter to words in common and create an absolute diff column
common_words <- all_tdm_df %>%
filter(
AmazonPro != 0,
GooglePro != 0
) %>%
mutate(diff = abs(AmazonPro - GooglePro))
# Extract top 5 common bigrams
(top5_df <- top_n(common_words, 5))
# Create the pyramid plot
pyramid.plot(top5_df$AmazonPro, top5_df$GooglePro,
labels = top5_df$terms, gap = 12,
top.labels = c("Amzn", "Pro Words", "Goog"),
main = "Words in Common", unit = NULL)
Instruction:
top_n()
on common_words
, obtain the top 5
bigrams weighted on the diff
column. The results of the new object will print to your console.pyramid.plot()
. Pass in top5_df$AmazonNeg
, top5_df$GoogleNeg
, and labels = top5_df$terms. For better labeling, set
gap
to 12
.top.labels
to c("Amzn", "Neg Words", "Goog")
The main
and unit
arguments are set for you.
# Extract top 5 common bigrams
(top5_df <- top_n(common_words, 5, wt = diff))
# Create a pyramid plot
pyramid.plot(
# Amazon on the left
top5_df$AmazonNeg,
# Google on the right
top5_df$GoogleNeg,
# Use terms for labels
labels = top5_df$terms,
# Set the gap to 12
gap = 12,
# Set top.labels to "Amzn", "Neg Words" & "Goog"
top.labels = c("Amzn", "Neg Words", "Goog"),
main = "Words in Common",
unit = NULL
)
Instruction:
在这里插入代码片
Instruction:
在这里插入代码片