k跳过图是一个ngram,它是所有ngram和每个(ki)跳过图直到(ki)==
0(包括0个跳过克)的超集。那么,如何在python中有效地计算这些skipgram?
以下是我尝试过的代码,但未达到预期的效果:
<pre>
input_list = ['all', 'this', 'happened', 'more', 'or', 'less']
def find_skipgrams(input_list, N,K):
bigram_list = []
nlist=[]
K=1
for k in range(K+1):
for i in range(len(input_list)-1):
if i+k+1<len(input_list):
nlist=[]
for j in range(N+1):
if i+k+j+1<len(input_list):
nlist.append(input_list[i+k+j+1])
bigram_list.append(nlist)
return bigram_list
</pre>
上面的代码无法正确渲染,但是打印find_skipgrams(['all', 'this', 'happened', 'more', 'or', 'less'],2,1)
后输出如下
[[‘this’,’happened’,’more’],[‘happened’,’more’,’or’],[‘more’,’or’,’less’],[‘or’,’less
‘],[‘少’],[‘发生’,’更多’,’或’],[‘更多’,’或’,’少’],[‘或’,’少’],[‘少’], [‘减’]]
此处列出的代码也无法提供正确的输出:https
:
//github.com/heaven00/skipgram/blob/master/skipgram.py
打印skipgram_ndarray(“您叫什么名字”)给出:[‘What,is’,’is,your’,’your,name’,’name,’,’What,your’,’is,name’]
名称是一个会标!
在OP链接的文件中,以下字符串:
叛乱分子在持续的战斗中丧生
产量:
2-skip-bi-grams =
2-skip-tri-grams =
{叛乱分子被杀,叛乱分子被杀,正在进行中的叛乱分子被杀,正在进行中的叛乱分子,在战斗中的叛乱分子,叛乱分子在进行中的战斗,在进行中被杀,在战斗中被杀,在战斗中被杀,在进行中的战斗}。
略微修改NLTK的ngrams
代码(https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):
from itertools import chain, combinations
import copy
from nltk.util import ngrams
def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
if pad_left:
sequence = chain((pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (pad_symbol,) * (n-1))
return sequence
def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
sequence_length = len(sequence)
sequence = iter(sequence)
sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)
if sequence_length + pad_left + pad_right < k:
raise Exception("The length of sentence + padding(s) < skip")
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
history = []
nk = n+k
# Return point for recursion.
if nk < 1:
return
# If n+k longer than sequence, reduce k by 1 and recur
elif nk > sequence_length:
for ng in skipgrams(list(sequence), n, k-1):
yield ng
while nk > 1: # Collects the first instance of n+k length history
history.append(next(sequence))
nk -= 1
# Iterative drop first item in history and picks up the next
# while yielding skipgrams for each iteration.
for item in sequence:
history.append(item)
current_token = history.pop(0)
# Iterates through the rest of the history and
# pick out all combinations the n-1grams
for idx in list(combinations(range(len(history)), n-1)):
ng = [current_token]
for _id in idx:
ng.append(history[_id])
yield tuple(ng)
# Recursively yield the skigrams for the rest of seqeunce where
# len(sequence) < n+k
for ng in list(skipgrams(history, n, k-1)):
yield ng
让我们做一些doctest来匹配本文中的示例:
>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
但请注意,如果使用n+k > len(sequence)
,它将产生与相同的效果skipgrams(sequence, n, k-1)
(这不是错误,它是故障安全功能),例如
>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
这是允许的,n == k
但不允许这样n > k
做,如各行所示:
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
为了理解起见,让我们尝试理解“神秘”这一行:
for idx in list(combinations(range(len(history)), n-1)):
pass # Do something
给定唯一项列表,组合会产生以下结果:
>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
并且由于令牌列表的索引始终是唯一的,例如
>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0) # i.e. 'this'
>>> range(len(sent))
[0,1,2,3]
可以计算范围的可能组合(不替换):
>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
如果我们将索引映射回令牌列表:
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
然后,将串联起来current_token
,得到当前标记和context + skip窗口的跳过图:
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
因此,在此之后,我们继续下一个单词。
我想要一个代码,显示某人在语音频道中的完整时间,但我不知道如何启动和停止计数器。
我需要计算一个数字的平方根,例如或。我如何用Python实现它? 输入可能都是正整数,并且相对较小(比如说不到10亿),但万一不是,有什么东西可能会断裂吗? 注:这是在元讨论一个现有的标题相同的问题后,试图提出一个规范问题。 相关的 python中的整数平方根
问题内容: 我需要在python中使用sympy计算下面的表达式吗? 在中,这种情况下如何在python中使用sympy计算表达式?请帮我。 问题答案: 该文档位于:http : //docs.sympy.org/。您应该真正阅读它! 要“计算”您的表达式,请编写如下代码: 就是这样。如果通过“计算”表示其他含义,则还可以求解exp = 0: 对于其他所有内容,您应该真正阅读文档。也许从这里开始:
问题内容: 假设我有一个清单: 我想创建一个计算n天移动平均值的函数。所以如果是5,我希望我的代码计算第一个1-5,将其相加并找到平均值,即3.0,然后继续计算2-6,计算平均值,即4.0,然后3- 7、4-8、5-9、6-10。 我不想计算前n-1天,因此从第n天开始,它将计算前几天。 这似乎可以打印出我想要的内容: 但是,我不知道如何计算这些列表中的数字。有任何想法吗? 问题答案: 旧版本的P
问题内容: 如何在python中计算程序运行时间? 问题答案: 您可能需要看一下该模块: http://docs.python.org/library/timeit.html 或模块: http://docs.python.org/library/profile.html 这里还有一些不错的教程: http://www.doughellmann.com/PyMOTW/profile/index.h
问题内容: 我有一个具有以下格式的.txt文件, 尽管显然它要大得多,但实际上是这样。基本上,我试图总结每个单独字符串在文件中的次数(每个字母/字符串在单独的一行上,因此从技术上讲文件是C \ nV \ nEH \ n等。但是,当我尝试将这些文件转换为列表,然后使用count函数时,它会分离出字母,以使诸如’IRQ’之类的字符串为[‘\ n’I’,’R’ ,’Q’,’\ n’],这样当我计算它时,