pip3 install numpy
pip3 install pandas
pip3 install nltk
import nltk
nltk.download()
# test NLTK - Basics
from nltk import sent_tokenize, word_tokenize, pos_tag
text = "Text and Vision Intelligence is a course that deal with interpreting texts and images computationally. This has become increasingly important in the last decade due to a large amount of texts and images online as well offline."
print(sent_tokenize(text))
print(word_tokenize(text))
print(pos_tag(word_tokenize(text)))
The theory and practice of computationally extracting and comprehending knowledge from natural texts(are human generated and hence the information is unstructured.)
Means extracting information for intelligence from free text.
Data Science (includes text processing)
Natural Language Processing (NLP)
Text Mining
Natural Language Engineering
Text Processing
Linguistics : How words, phrases, and sentences are formed.
Psycholinguistics : How people understand and communicate using a human language.
Computational linguistics : Deals with models and computational aspects of NLP.
Artificial intelligence : issues related to knowledge representation and reasoning.
NL Engineering : implementation of large, realistic systems.
Free text contains unstructured information
High degree of ambiguity in naturally occurring texts.
Meaning derived from context
Context can be external to the text being processed.
Structured information is organized, hence easy to comprehend
Eg. Databases, spreadsheet and XML files.
Human society took more than 300,000 years to create 12 exabytes (1 billion gigabytes) of data
We are expected to double that in the next 3 years!
Needed to take advantage of the vast amount of information encoded in natural languages, online as well as offline.
Needed even to interface with vast amount of organized information in databases.
Needed to be able to communicate with machines using natural language.
Text-based applications:
finding documents on certain topics (document catégorisation)
information retrieval; search for keywords or concepts.
(free) information extraction; relevant to a topic.
text comprehension
translation from a language to another
summarization
knowledge management
Dialogue-based applications:
human-machine communication
question-answering
tutoring systems
problem solving
Speech processing:
Voice to text and vice versa conversions
Phonetic-Morphological Knowledge-Syntactic Knowledge-Semantic Knowledge-Pragmatic Knowledge-Discourse Knowledge
There are several possible ways to interpret an utterance in context
We need to find the most likely interpretation
Discourse model provides a computational framework for this search
Investigation of lexical ▶️connectivity patterns as the reflection of discourse structure
Specification of a small set of ▶️rhetorical relations among discourse segments
Adaption of the notion of ▶️grammar
Examination of ▶️intentions and relations among them as the foundation of discourse structure
ACL - Association of Computational Linguistics
AAAI -every year /IJCAI -every second year
MUC - Message Understanding Conf.
DUC – Document Understanding Conf.
SIGIR – Special Interest Group in IR
import nltk
#####################################
tokens = nltk. word_tokenize("AUT is in New Zealand")
postags = nltk.pos_tag(tokens)
print(postags)
#####################################
The code will give you a result similar to one shown below.
[('AUT', 'NNP'), ('is', 'VBZ'), ('in', 'IN'), ('New', 'NNP'), ('Zealand', 'NNP')]
#####################################
Instantiate the following taggers from NLTK.
a.Unigram tagger
b.TnT tagger
c.Perceptron tagger
d.CRF tagger
import nltk
######################Simple Tagging###############################################################################################
text = nltk.word_tokenize("The city of Auckland is in New Zeland which is in the Pacific")
print(nltk.pos_tag(text))
# Many words can function in different roles, such as run,live and talk.
text = nltk.word_tokenize("The talk was boring")
print(nltk.pos_tag(text))
text = nltk.word_tokenize("You should talk more in class")
print(nltk.pos_tag(text))
############################################################################################################################
# The tags for tokens computed from the context in which they appear.
# The text.similar() method takes a word w, finds all contexts w1 w w2, then finds all words w' that appear in the same context, i.e. w1 w'w2.
# You can allocate same tag to w'
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')
print("-------------------------------------------------------------")
text.similar('bought')
print("-------------------------------------------------------------")
text.similar('over')
print("-------------------------------------------------------------")
text.similar('the')
###########################################################################################################################
#Representing tagged tokens
tagged_token = nltk.tag.str2tuple('fly/NN')
# tagged_token
# ('fly', 'NN')
print(tagged_token[0])
print(tagged_token[1])
#Reading tagged corpora
print(nltk.corpus.nps_chat.tagged_words())
print(nltk.corpus.conll2000.tagged_words())
print(nltk.corpus.treebank.tagged_words())
#Taggged corpora for several other languages are also available
print(nltk.corpus.sinica_treebank.tagged_words())
print(nltk.corpus.indian.tagged_words())
print(nltk.corpus.mac_morpho.tagged_words())
print(nltk.corpus.conll2002.tagged_words())
print(nltk.corpus.cess_cat.tagged_words())
# ##########################################################################################################################f
#
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())
tag_fd.plot(cumulative=False)
#
# ##########################################################################################################################f
#Lets see what parts of speech frequently occur before a noun
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
print(noun_preceders)
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag, _) in fdist.most_common()])
# ##########################################################################################################################
#
#Explore the corpora
#Lets see which word most oftern follows the word "often". Verb is the highest and nouns never even appear.
from nltk.corpus import brown
brown_learned_text = brown.words(categories='learned')
print(sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often')))
#Probably better to see the POS that follows the word "often"
brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()
#
# ##########################################################################################################################
#Look at trigram context, get all "verb TO verb" trigrams.
from nltk.corpus import brown
def process(sentence):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
print(w1, w2, w3)
for tagged_sent in brown.tagged_sents():
process(tagged_sent)
# ##########################################################################################################################
#
#Lets look at words that are hardest to tag, ie, they are most ambiguous.
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag) for (word, tag) in brown_news_tagged)
for word in sorted(data.conditions()):
if len(data[word]) > 2:
tags = [tag for (tag, _) in data[word].most_common()]
print(word, ' '.join(tags))
#
# ##########################################################################################################################
#
# Indexed lists versus dictionaries
# Indexed list - is a lookup table with index numbers and and an entry which is a string. Eg. a document is represented as a list
#Dictionary - is a again a table but this time the lookup is done using a string and you get back a value which can be a number or another string. Eg. a frequency dist. table.
# Eg of a dictionary
pos = {}
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'
print(pos)
print('ideas')
print("\nUseful calls to konw for dictionary iterations")
print(list(pos))
print(sorted(pos))
for word in sorted(pos):
print(word + ":" + pos[word])
print(list(pos.keys()))
print(list(pos.values()))
print(list(pos.items()))
for key, val in sorted(pos.items()):
print(key + ":", val)
##########################################################################################################################
#Lets use some datasets from NLTK
from collections import defaultdict
counts = defaultdict(int)
from nltk.corpus import brown
for (word, tag) in brown.tagged_words(categories='news', tagset='universal'):
counts[tag] += 1
print(counts['NOUN'])
print(sorted(counts))
from operator import itemgetter
print(sorted(counts.items(), key=itemgetter(1), reverse=True))
##########################################################################################################################
#A handy trick to extract an element from a tuple
from nltk.corpus import brown
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
print(tags)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HEmyUjZV-1618573377646)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416111035571.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GSYc8BL8-1618573377647)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416103912950.png)]
Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction
N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adjective purple, tall, ridiculous
ADV adverb unfortunately, slowly
P preposition of, by, to, for, at
PRO pronoun I, me, mine, he his, her
DET determiner the, a, an,that, those
TP - True Positives: Machine identified positives which are also similarly identified positives by human.
FP - False Positives: Machine identified positives which have been identified as negatives as by human.
FN - False Negatives: Machine identified negatives which have been identified as positives by human.
TN - True Negatives: Machine identified negatives which have been identified as negatives by human.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fYvxStag-1618573377649)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416112636315.png)]
"""Convert words to vectores that can be used with classifiers"""
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
print(vectorizer.vocabulary_)
#Try another sentence
text2 = ["the quick puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())
"""BOW model is not very effeictive. represents presence or absence of a token in a document.
Lets keep count of tokens in a document
Using TFIDF instead of BOW, TFIDF also takes into account the frequency instead of just the occurance.
calculated as:
Term frequency (normalized) = (Number of Occurrences of a word)/(Total words in the document) : normalizes based on the size of the document.
IDF(word) = Log((Total number of documents)/(Number of documents containing the word)) : reduces the impact words that are common across documents, eg. the.
TF-IDF is the product of the two."""
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[2]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
""" Extracting n grams from text """
import nltk
text = nltk.word_tokenize("The quick brown fox jumped on the dog")
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append((input_list[i], input_list[i+1]))
return bigram_list
#get individual items from the bigram
bigrams = find_bigrams(text)
print(bigrams)
print(bigrams[0].__getitem__(0))
print(bigrams[0].__getitem__(1))
#Now write a function to generate trigrams.
"""using the nltk ngrams function"""
from nltk import ngrams
sentence = 'The quick brown fox jumped over the dog.'
n = 6
sixgrams = ngrams(sentence.split(), n)
ngrams = []
for grams in sixgrams:
ngrams.append(grams)
print(ngrams)
##########################################################################################################################
# Distance metrices
from nltk.metrics import *
s1 = "John went to town on a bike"
s2 = "Peter went to town in a bus"
print("Edit Distnance same string: ",edit_distance(s1,s1))
print("Edit Distnance: ",edit_distance(s1,s2))
print("Binary Distnance: ",binary_distance(set(s1),set(s2)))
print("Jaccard Distnance: ",jaccard_distance(set(s1),set(s2)))
print("Masi Distnance: ",masi_distance(set(s1),set(s2)))
##########################################################################################################################
Vector representation does not consider the ordering of words in a document
The dog bit the man and The man bit the dog would have same representation
This is called the bag of words model.
We will see later that there are models that recover the positional information
However the BOW model is surprisingly effective in most situations.
which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pbcwOBDg-1618573377652)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135529200.png)]
which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZlgNHSjj-1618573377653)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135628376.png)]
The tf‐idf weight of a term is the product of its tf weight and its idf weight.
tf-idf = log(1+ tf) * log(N/df)
tf-idf = tf * log(N/df) - alternative
Is the default measure metric
Measuring distance between text documents, given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance of the two documents is defined as[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EZYZPD9K-1618573377655)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142611411.png)]
Defined as the cosine of the angle between two vectors.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3BTrFcMM-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142648405.png)]
Compares the sum weight of shared terms to the sum weight of terms that are present in either of the two document but are not the shared terms.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-F7JCYIz8-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142720112.png)]
Also known as the edit distance.
Is the minimum number of single character edits (insertions, deletions or substitutions) required to change one sentence into another.
Between two strings of equal length is the number of positions at which the corresponding symbols are different.
ie, the minimum number of substitutions required to change one string into the other.
Or (or originally) the minimum number of errors that could have transformed one string into the other.
methods:
Named Entity Recognition Relation detection and Classification Event Processing Temporal Processing Author/source detection Main Concept/theme detection and tracking Specific Information tracking
NER using HMM Learnert
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
import nltk
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
#print(chunked)
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
if continuous_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
return continuous_chunk
txt = "Jacinda Ardern is the Prime Minister of New Zealand but Roenzo isn't."
print (get_continuous_chunks(txt))
for sent in nltk.sent_tokenize(txt):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
Create regular expressions to extract:
Telephone number
Capitalized nameseg. blocks of digits separated by hyphens
RegEx = (\d+\-)+\d+
matches valid phone numbers like 0210-126-1125 and 09-816-225
incorrectly extracts social security numbers 123-45-6789
fails to identify numbers like 800.865.1125 and (800)865-CARE Improved RegEx = (\d{3}[-.\ ()]){1,2}[\dA-Z]{4}
Regular expressions provide a flexible way to match strings of text, such as particular characters, words, or patterns of characters
Perl RegEx (similar to grep regex and python)
\w (word char) any alpha-numeric
\d (digit char) any digit
\s (space char) any whitespace
. (wildcard) anything (single character)
\b word boundary
^ beginning of string
$ end of string
? For 0 or 1 occurrencesfor 1 or more occurrences
specific range of number of occurrences: {min,max}.
A{1,5} One to five A’s.
A{5,} Five or more A’s
A{5} Exactly five A’s
Create rules to extract locations
Capitalized word + {city, center, river} indicates location
Ex. New York city
Hudson riverCapitalized word + {street, boulevard, avenue} indicates location
Ex. Fifth avenue
Use context patterns
[PERSON] earned [MONEY]
Ex. Frank earned $20[PERSON] joined [ORGANIZATION]
Ex. Sam joined IBM[PERSON],[JOBTITLE]
Ex. Mary, the teacherstill not so simple:
[PERSON|ORGANIZATION|ANIMAL] fly to [LOCATION|PERSON|EVENT]
Ex. Jerry flew to Japan
Sarah flies to the party
Delta flies to Europe
bird flies to trees
bee flies to the wood
first word of a sentence is capitalized
sometimes titles in web pages are all capitalized
nested named entities contain non-capital words
University of Southern California is Organization
all nouns in German are capitalized
Tweets/Micro-blogs have “loose” capitalization
movie titles
books
singers
restaurants
etc.
labeled training examples
methods: Hidden Markov Models, k-Nearest Neighbors, Decision Trees, AdaBoost, SVM, NN…
example: NE recognition, POS tagging, Parsing
labels must be automatically discovered
method: clustering
example: NE disambiguation, text classification
small percentage of training examples are labeled, the rest is unlabeled
methods: bootstrapping, active learning, co-training, self-training
example: NE recognition, POS tagging, Parsing, …
NEI
: Identify named entities using BIO tagsB beginning of an entity
I continues the entity
O word outside the entity
NEC
: Classify into a predefined set of categoriesPerson names
Organizations (companies, governmental organizations, etc.)
Locations (cities, countries, etc.)
Miscellaneous (movie titles, sport events, etc.)
a leaf node which indicates the value of the target attribute (class) of examples
OR
a decision node which specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test
# #Collect all nouns and their modifiers
# import spacy
# nlp = spacy.load('en_core_web_sm')
# doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
# for chunk in doc.noun_chunks:
# print(chunk.text, chunk.label_, chunk.root.text)
""" Dep parsing example"""
import spacy
"""
You will need to install the following particular version of spacy.
pip3 install nltk pip install spacy==2.3.5 pip install
You will also need to install en_core_web_sm using the following.
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz
"""
nlp = spacy.load('en_core_web_sm')
# doc = nlp('John ate icecream and Peter ate apple')
# doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
doc = nlp('A man with a knife and a boy hit the dazed shopkeeper on the head yesterday.')
for token in doc:
print("{0}/{1} <--{2}-- {3}/{4}".format(
token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))
the nodes stand for the words in an utterance
The links between the words represent dependency relations between pairs of words.
Relations may be typed (labeled), or not.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MNYZBuK8-1618573377657)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193303310.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0PCdrsPU-1618573377658)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193312289.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lgUjTXxz-1618573377659)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193424563.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5WbVE90I-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193438434.png)]
Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ip869gDy-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193524178.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k3lTwYW5-1618573377662)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193558498.png)]
Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
[外链图片转存中…(img-5WbVE90I-1618573377660)]
Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.
[外链图片转存中…(img-Ip869gDy-1618573377660)]
[外链图片转存中…(img-k3lTwYW5-1618573377662)]
Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
Has less information, however is sufficient for most applications