Text&Vision

叶鸿振

2023-12-01

Text&Vision

 Text Intelligence

Lab_info

NLTK - Natural Language Tool Kit

pip3 install numpy
pip3 install pandas
pip3 install nltk

import nltk
nltk.download()
# test NLTK - Basics
from nltk import sent_tokenize, word_tokenize, pos_tag
text = "Text and Vision Intelligence is a course that deal with interpreting texts and images computationally. This has become increasingly important in the last decade due to a large amount of texts and images online as well offline."
print(sent_tokenize(text))
print(word_tokenize(text))
print(pos_tag(word_tokenize(text)))

❔ What is text intelligence?

The theory and practice of computationally extracting and comprehending knowledge from natural texts(are human generated and hence the information is unstructured.)

Means extracting information for intelligence from free text.

 Related names of text intelligence

Data Science (includes text processing)
Natural Language Processing (NLP)
Text Mining
Natural Language Engineering
Text Processing

 Disciplines of Text Processing

Linguistics : How words, phrases, and sentences are formed.
Psycholinguistics : How people understand and communicate using a human language.
Computational linguistics : Deals with models and computational aspects of NLP.
Artificial intelligence : issues related to knowledge representation and reasoning.
NL Engineering : implementation of large, realistic systems.

❓ Why is it a difficult problem?

Free text contains unstructured information
High degree of ambiguity in naturally occurring texts.
Meaning derived from context
Context can be external to the text being processed.

❓ Why we need text intelligence?

Structured information is organized, hence easy to comprehend

Eg. Databases, spreadsheet and XML files.

❓ Why Text Processing?

Human society took more than 300,000 years to create 12 exabytes (1 billion gigabytes) of data
We are expected to double that in the next 3 years!
Needed to take advantage of the vast amount of information encoded in natural languages, online as well as offline.
Needed even to interface with vast amount of organized information in databases.
Needed to be able to communicate with machines using natural language.

 Applicaions of NLP

Text-based applications:

finding documents on certain topics (document catégorisation)
information retrieval; search for keywords or concepts.
(free) information extraction; relevant to a topic.
text comprehension
translation from a language to another
summarization
knowledge management

Dialogue-based applications:

human-machine communication
question-answering
tutoring systems
problem solving

Speech processing:

Voice to text and vice versa conversions

levels of language processing

Phonetic-Morphological Knowledge-Syntactic Knowledge-Semantic Knowledge-Pragmatic Knowledge-Discourse Knowledge

Inference in Discourse Processing

There are several possible ways to interpret an utterance in context
We need to find the most likely interpretation
Discourse model provides a computational framework for this search

Some Models of Discourse Structure

Investigation of lexical ▶️connectivity patterns as the reflection of discourse structure
Specification of a small set of ▶️rhetorical relations among discourse segments
Adaption of the notion of ▶️grammar
Examination of ▶️intentions and relations among them as the foundation of discourse structure

State of the art in NLP Research

ACL - Association of Computational Linguistics

AAAI -every year /IJCAI -every second year

MUC - Message Understanding Conf.

DUC – Document Understanding Conf.

SIGIR – Special Interest Group in IR

 POS Tagging

Lab_info

POS-tags

import nltk
#####################################
tokens = nltk. word_tokenize("AUT is in New Zealand")
postags = nltk.pos_tag(tokens)
print(postags)
#####################################
The code will give you a result similar to one shown below.
[('AUT', 'NNP'), ('is', 'VBZ'), ('in', 'IN'), ('New', 'NNP'), ('Zealand', 'NNP')]
#####################################
Instantiate the following taggers from NLTK.
a.Unigram tagger
b.TnT tagger
c.Perceptron tagger
d.CRF tagger

import nltk
######################Simple Tagging###############################################################################################
text = nltk.word_tokenize("The city of Auckland is in New Zeland which is in the Pacific")
print(nltk.pos_tag(text))

# Many words can function in different roles, such as run,live and talk.
text = nltk.word_tokenize("The talk was boring")
print(nltk.pos_tag(text))

text = nltk.word_tokenize("You should talk more in class")
print(nltk.pos_tag(text))

############################################################################################################################

# The tags for tokens computed from the context in which they appear.
# The text.similar() method takes a word w, finds all contexts w1 w w2, then finds all words w' that appear in the same context, i.e. w1 w'w2.
# You can allocate same tag to w'

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')
print("-------------------------------------------------------------")
text.similar('bought')
print("-------------------------------------------------------------")
text.similar('over')
print("-------------------------------------------------------------")
text.similar('the')
###########################################################################################################################

#Representing tagged tokens
tagged_token = nltk.tag.str2tuple('fly/NN')
# tagged_token
# ('fly', 'NN')
print(tagged_token[0])
print(tagged_token[1])

#Reading tagged corpora
print(nltk.corpus.nps_chat.tagged_words())
print(nltk.corpus.conll2000.tagged_words())
print(nltk.corpus.treebank.tagged_words())

#Taggged corpora for several other languages are also available
print(nltk.corpus.sinica_treebank.tagged_words())
print(nltk.corpus.indian.tagged_words())
print(nltk.corpus.mac_morpho.tagged_words())
print(nltk.corpus.conll2002.tagged_words())
print(nltk.corpus.cess_cat.tagged_words())

# ##########################################################################################################################f
#
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())
tag_fd.plot(cumulative=False)
#
# ##########################################################################################################################f
#Lets see what parts of speech frequently occur before a noun
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
print(noun_preceders)
fdist = nltk.FreqDist(noun_preceders)
print([tag for (tag, _) in fdist.most_common()])

# ##########################################################################################################################
#
#Explore the corpora
#Lets see which word most oftern follows the word "often". Verb is the highest and nouns never even appear.
from nltk.corpus import brown
brown_learned_text = brown.words(categories='learned')
print(sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often')))
#Probably better to see the POS that follows the word "often"
brown_lrnd_tagged = brown.tagged_words(categories='learned', tagset='universal')
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == 'often']
fd = nltk.FreqDist(tags)
fd.tabulate()
#
# ##########################################################################################################################
#Look at trigram context, get all "verb TO verb" trigrams.
from nltk.corpus import brown
def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)

for tagged_sent in brown.tagged_sents():
    process(tagged_sent)

# ##########################################################################################################################
#
#Lets look at words that are hardest to tag, ie, they are most ambiguous.
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag) for (word, tag) in brown_news_tagged)

for word in sorted(data.conditions()):
     if len(data[word]) > 2:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))
#
# ##########################################################################################################################
#
# Indexed lists versus dictionaries
# Indexed list - is a lookup table with index numbers and and an entry which is a string. Eg. a document is represented as a list
#Dictionary - is a again a table but this time the lookup is done using a string and you get back a value which can be a number or another string. Eg. a frequency dist. table.

# Eg of a dictionary
pos = {}
pos['ideas'] = 'N'
pos['sleep'] = 'V'
pos['furiously'] = 'ADV'
print(pos)
print('ideas')
print("\nUseful calls to konw for dictionary iterations")
print(list(pos))
print(sorted(pos))
for word in sorted(pos):
    print(word + ":" + pos[word])

print(list(pos.keys()))
print(list(pos.values()))
print(list(pos.items()))
for key, val in sorted(pos.items()):
    print(key + ":", val)

##########################################################################################################################
#Lets use some datasets from NLTK
from collections import defaultdict
counts = defaultdict(int)
from nltk.corpus import brown
for (word, tag) in brown.tagged_words(categories='news', tagset='universal'):
    counts[tag] += 1
print(counts['NOUN'])
print(sorted(counts))
from operator import itemgetter
print(sorted(counts.items(), key=itemgetter(1), reverse=True))

##########################################################################################################################

#A handy trick to extract an element from a tuple
from nltk.corpus import brown
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
print(tags)

⚛️ POS tags

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HEmyUjZV-1618573377646)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416111035571.png)]

⚛️ Syntactic tags

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GSYc8BL8-1618573377647)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416103912950.png)]

 9 traditional word classes of parts of speech

Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction

Example

N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adjective purple, tall, ridiculous
ADV adverb unfortunately, slowly
P preposition of, by, to, for, at
PRO pronoun I, me, mine, he his, her
DET determiner the, a, an,that, those

Evaluation Matrices-1

TP - True Positives: Machine identified positives which are also similarly identified positives by human.
FP - False Positives: Machine identified positives which have been identified as negatives as by human.
FN - False Negatives: Machine identified negatives which have been identified as positives by human.
TN - True Negatives: Machine identified negatives which have been identified as negatives by human.

Evaluation Matrices-2

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fYvxStag-1618573377649)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416112636315.png)]

 Vector Space Model/Similarity Computations

Lab_info

"""Convert words to vectores that can be used with classifiers"""

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
print(vectorizer.vocabulary_)

#Try another sentence
text2 = ["the quick puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

"""BOW model is not very effeictive. represents presence or absence of a token in a document.
Lets keep count of tokens in a document

Using TFIDF instead of BOW, TFIDF also takes into account the frequency instead of just the occurance.
calculated as:
Term frequency (normalized) = (Number of Occurrences of a word)/(Total words in the document) : normalizes based on the size of the document.
IDF(word) = Log((Total number of documents)/(Number of documents containing the word)) : reduces the impact  words that are common across documents, eg. the.
TF-IDF is the product of the two."""


from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
		"The dog.",
		"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[2]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

""" Extracting n grams from text """
import  nltk

text = nltk.word_tokenize("The quick brown fox jumped on the dog")
def find_bigrams(input_list):
  bigram_list = []
  for i in range(len(input_list)-1):
      bigram_list.append((input_list[i], input_list[i+1]))

  return bigram_list
#get individual items from the bigram
bigrams = find_bigrams(text)
print(bigrams)
print(bigrams[0].__getitem__(0))
print(bigrams[0].__getitem__(1))
#Now write a function to generate trigrams.
"""using the nltk ngrams function"""

from nltk import ngrams
sentence = 'The quick brown fox jumped over the dog.'
n = 6
sixgrams = ngrams(sentence.split(), n)
ngrams = []
for grams in sixgrams:
  ngrams.append(grams)
print(ngrams)
##########################################################################################################################

# Distance metrices
from nltk.metrics import *
s1 = "John went to town on a bike"
s2 = "Peter went to town in a bus"
print("Edit Distnance same string: ",edit_distance(s1,s1))
print("Edit Distnance: ",edit_distance(s1,s2))
print("Binary Distnance: ",binary_distance(set(s1),set(s2)))
print("Jaccard Distnance: ",jaccard_distance(set(s1),set(s2)))
print("Masi Distnance: ",masi_distance(set(s1),set(s2)))
##########################################################################################################################

BOW - Bag of words model

Vector representation does not consider the ordering of words in a document
The dog bit the man and The man bit the dog would have same representation
This is called the bag of words model.
We will see later that there are models that recover the positional information
However the BOW model is surprisingly effective in most situations.

TF: Term Frequency

which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pbcwOBDg-1618573377652)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135529200.png)]

IDF: Inverse Document Frequency

which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZlgNHSjj-1618573377653)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416135628376.png)]

Tf-idf weighting scheme

The tf‐idf weight of a term is the product of its tf weight and its idf weight.

tf-idf = log(1+ tf) * log(N/df)
tf-idf = tf * log(N/df) - alternative

Euclidean Distance

Is the default measure metric
Measuring distance between text documents, given two documents da and db represented by their term vectors ta and tb respectively, the Euclidean distance of the two documents is defined as

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EZYZPD9K-1618573377655)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142611411.png)]

Cosine Distance

Defined as the cosine of the angle between two vectors.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3BTrFcMM-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142648405.png)]

Jaccard Coefficient

Compares the sum weight of shared terms to the sum weight of terms that are present in either of the two document but are not the shared terms.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-F7JCYIz8-1618573377656)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416142720112.png)]

Levenshtein Distance

Also known as the edit distance.
Is the minimum number of single character edits (insertions, deletions or substitutions) required to change one sentence into another.

Hamming Distance

Between two strings of equal length is the number of positions at which the corresponding symbols are different.
ie, the minimum number of substitutions required to change one string into the other.
Or (or originally) the minimum number of errors that could have transformed one string into the other.

 Information Extraction

methods:

Named Entity Recognition Relation detection and Classification Event Processing Temporal Processing Author/source detection Main Concept/theme detection and tracking Specific Information tracking

Lab_info

NER using HMM Learnert

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
import nltk
def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []
    #print(chunked)
    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Jacinda Ardern is the Prime Minister of New Zealand but Roenzo isn't."
print (get_continuous_chunks(txt))


for sent in nltk.sent_tokenize(txt):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

Named Entity Recognition(NER)

Rule Based NER (1)

Create regular expressions to extract:

Telephone number
E-mail
Capitalized names

eg. blocks of digits separated by hyphens
  RegEx = (\d+\-)+\d+
matches valid phone numbers like 0210-126-1125 and 09-816-225
incorrectly extracts social security numbers 123-45-6789
fails to identify numbers like 800.865.1125 and (800)865-CARE

Improved RegEx = (\d{3}[-.\ ()]){1,2}[\dA-Z]{4}

Rule Based NER (2)

Regular expressions provide a flexible way to match strings of text, such as particular characters, words, or patterns of characters

Perl RegEx (similar to grep regex and python)

\w (word char) any alpha-numeric
\d (digit char) any digit
\s (space char) any whitespace
. (wildcard) anything (single character)
\b word boundary
^ beginning of string
$ end of string
? For 0 or 1 occurrences

for 1 or more occurrences
specific range of number of occurrences: {min,max}.
A{1,5} One to five A’s.
A{5,} Five or more A’s
A{5} Exactly five A’s

Rule Based NER (3)

Create rules to extract locations
Capitalized word + {city, center, river} indicates location
Ex. New York city
Hudson river

Capitalized word + {street, boulevard, avenue} indicates location
Ex. Fifth avenue

Rule Based NER (4)

Use context patterns
[PERSON] earned [MONEY]
Ex. Frank earned $20

[PERSON] joined [ORGANIZATION]
Ex. Sam joined IBM

[PERSON],[JOBTITLE]
Ex. Mary, the teacher

still not so simple:
[PERSON|ORGANIZATION|ANIMAL] fly to [LOCATION|PERSON|EVENT]
Ex. Jerry flew to Japan
Sarah flies to the party
Delta flies to Europe
bird flies to trees
bee flies to the wood

❓Why simple things would not work?

Capitalization is a strong indicator for capturing proper names, but it can be tricky

first word of a sentence is capitalized
sometimes titles in web pages are all capitalized
nested named entities contain non-capital words
University of Southern California is Organization
all nouns in German are capitalized
Tweets/Micro-blogs have “loose” capitalization

No lexicon contains all existing proper names.
New proper names constantly emerge

movie titles
books
singers
restaurants
etc.

Learning System

Supervised learning

labeled training examples
methods: Hidden Markov Models, k-Nearest Neighbors, Decision Trees, AdaBoost, SVM, NN…
example: NE recognition, POS tagging, Parsing

Unsupervised learning

labels must be automatically discovered
method: clustering
example: NE disambiguation, text classification

Semi-supervised learning

small percentage of training examples are labeled, the rest is unlabeled
methods: bootstrapping, active learning, co-training, self-training
example: NE recognition, POS tagging, Parsing, …

❗️Two stage NER - NEI and NEC

NEI: Identify named entities using BIO tags

B beginning of an entity
I continues the entity
O word outside the entity

NEC: Classify into a predefined set of categories

Person names
Organizations (companies, governmental organizations, etc.)
Locations (cities, countries, etc.)
Miscellaneous (movie titles, sport events, etc.)

Decision Trees

The classifier has a tree structure, where each node is either:

a leaf node which indicates the value of the target attribute (class) of examples
OR
a decision node which specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test

An instance xp is classified by starting at the root of the tree and moving through it until a leaf node is reached, which provides the classification of the instance

裸Building Decision Trees

Select which attribute to test at each node in the tree.
The goal is to select the attribute that is most useful for classifying examples.
Top-down, greedy search through the space of possible decision trees. It picks the best attribute and never looks back to reconsider earlier choices.

 Formal Grammar CFG/Dependency Parsing

Lab_info


# #Collect all nouns and their modifiers
# import spacy
# nlp = spacy.load('en_core_web_sm')
# doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
# for chunk in doc.noun_chunks:
#     print(chunk.text, chunk.label_, chunk.root.text)

""" Dep parsing example"""
import spacy
"""
You will need to install the following particular version of spacy.
 pip3 install nltk pip install spacy==2.3.5 pip install
You will also need to install en_core_web_sm using the following.
pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

"""
nlp = spacy.load('en_core_web_sm')
# doc = nlp('John ate icecream and Peter ate apple')
# doc = nlp('Wall Street Journal just published an interesting piece on crypto currencies')
doc = nlp('A man with a knife and a boy hit the dazed shopkeeper on the head yesterday.')
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Dependency Grammars

In CFG-style phrase-structure grammars the main focus is on constituents.
But it turns out you can get a lot done with just binary relations among the words in an utterance.
In a dependency grammar framework, a parse is a tree where

the nodes stand for the words in an utterance
The links between the words represent dependency relations between pairs of words.
Relations may be typed (labeled), or not.

Dependency Relations

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MNYZBuK8-1618573377657)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193303310.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0PCdrsPU-1618573377658)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193312289.png)]

Dependency parsing V CFG parsing

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lgUjTXxz-1618573377659)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193424563.png)]

Dependency

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5WbVE90I-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193438434.png)]

Dependency Parsing

Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ip869gDy-1618573377660)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193524178.png)]

A typical Information Extraction task

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-k3lTwYW5-1618573377662)(C:\Users\maple\AppData\Roaming\Typora\typora-user-images\image-20210416193558498.png)]

Summary

Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
[外链图片转存中…(img-5WbVE90I-1618573377660)]

Dependency Parsing

Each linguistic word is connected via a directed link.
The parse tree captures the (unidirectional) relationship between words and phrases.

[外链图片转存中…(img-Ip869gDy-1618573377660)]

A typical Information Extraction task

[外链图片转存中…(img-k3lTwYW5-1618573377662)]

Summary

Context-free grammars can be used to model various facts about the syntax of a language.
When paired with parsers, such grammars constitute a critical component in many applications.
Constituency is a key phenomena easily captured with CFG rules.
Dependency parsing is based on words and their binary relations and is easier to do then to do then CFG,
Has less information, however is sufficient for most applications

Text&Vision

Text&Vision

 Text Intelligence

Lab_info

❔ What is text intelligence?

 Related names of text intelligence

 Disciplines of Text Processing

❓ Why is it a difficult problem?

❓ Why we need text intelligence?

❓ Why Text Processing?

 Applicaions of NLP

levels of language processing

Inference in Discourse Processing

Some Models of Discourse Structure

State of the art in NLP Research

 POS Tagging

Lab_info

⚛️ POS tags

⚛️ ​Syntactic tags

 9 traditional word classes of parts of speech​

Example

Evaluation Matrices-1

Evaluation Matrices-2

 Vector Space Model/Similarity Computations

Lab_info

BOW - Bag of words model

TF: Term Frequency

IDF: Inverse Document Frequency

Tf-idf weighting scheme

Euclidean Distance

Cosine Distance

Jaccard Coefficient

Levenshtein Distance

Hamming Distance

 Information Extraction

Lab_info

Named Entity Recognition(NER)

❓Why simple things would not work?

Learning System

❗️Two stage NER - NEI and NEC

Decision Trees

裸Building Decision Trees

 Formal Grammar CFG/Dependency Parsing

Lab_info

Dependency Grammars

Dependency parsing V CFG parsing

Dependency Parsing

A typical Information Extraction task

Summary

Dependency Parsing

A typical Information Extraction task

Summary

相关阅读

相关文章

相关问答

相关文档

⚛️ Syntactic tags

 9 traditional word classes of parts of speech