在.txt文件中找到最常用单词的Python程序，必须打印单词及其计数

辛盛

2023-03-14

问题内容：

截至目前，我有一个函数替换countChars函数，

def countWords(lines):
  wordDict = {}
  for line in lines:
    wordList = lines.split()
    for word in wordList:
      if word in wordDict: wordDict[word] += 1
      else: wordDict[word] = 1
  return wordDict

但是当我运行该程序时，它会吐出这种可憎的东西（这只是一个例子，它旁边大约有两页单词，其中有一个很大的数字）

before 1478
battle-field 1478
as 1478
any 1478
altogether 1478
all 1478
ago 1478
advanced. 1478
add 1478
above 1478

虽然显然这意味着代码足够运行，但我并没有从中得到想要的东西。它需要打印每个单词在文件中的次数（gb.txt，这是葛底斯堡的地址）。显然，文件中的每个单词不在其中的次数恰好是1478次。

我在编程方面还很陌生，所以有点困惑。

from __future__ import division

inputFileName = 'gb.txt'

def readfile(fname):
  f = open(fname, 'r')
  s = f.read()
  f.close()
 return s.lower()

def countChars(t):
  charDict = {}
  for char in t:
    if char in charDict: charDict[char] += 1
    else: charDict[char] = 1
  return charDict

def findMostCommon(charDict):
  mostFreq = ''
  mostFreqCount = 0
  for k in charDict:
    if charDict[k] > mostFreqCount:
      mostFreqCount = charDict[k]
      mostFreq = k
  return mostFreq

def printCounts(charDict):
  for k in charDict:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printAlphabetically(charDict):
  keyList = charDict.keys()
  keyList.sort()
  for k in keyList:
    #First, handle some chars that don't show up very well when they print
    if k == '\n': print '\\n', charDict[k]  #newline
    elif k == ' ': print 'space', charDict[k]
    elif k == '\t': print '\\t', charDict[k] #tab
    else: print k, charDict[k]  #Normal character - print it with its count

def printByFreq(charDict):
  aList = []
  for k in charDict:
    aList.append([charDict[k], k])
  aList.sort()     #Sort into ascending order
  aList.reverse()  #Put in descending order
  for item in aList:
    #First, handle some chars that don't show up very well when they print
    if item[1] == '\n': print '\\n', item[0]  #newline
    elif item[1] == ' ': print 'space', item[0]
    elif item[1] == '\t': print '\\t', item[0] #tab
    else: print item[1], item[0]  #Normal character - print it with its count

def main():
  text = readfile(inputFileName)
  charCounts = countChars(text)
  mostCommon = findMostCommon(charCounts)
  #print mostCommon + ':', charCounts[mostCommon]
  #printCounts(charCounts)
  #printAlphabetically(charCounts)
  printByFreq(charCounts)

main()

问题答案：

如果您需要计算段落中的单词数，则最好使用正则表达式。

让我们从一个简单的例子开始：

import re

my_string = "Wow! Is this true? Really!?!? This is crazy!"

words = re.findall(r'\w+', my_string) #This finds words in the document

结果：

>>> words
['Wow', 'Is', 'this', 'true', 'Really', 'This', 'is', 'crazy']

请注意，“是”和“是”是两个不同的词。我的猜测是，您希望对它们进行相同的计数，因此我们可以将所有单词大写，然后对其进行计数。

from collections import Counter

cap_words = [word.upper() for word in words] #capitalizes all the words

word_counts = Counter(cap_words) #counts the number each time a word appears

结果：

>>> word_counts
Counter({'THIS': 2, 'IS': 2, 'CRAZY': 1, 'WOW': 1, 'TRUE': 1, 'REALLY': 1})

你准备好了吗？

现在，我们需要做的与上次完全相同，只是这次我们正在读取文件。

import re
from collections import Counter

with open('your_file.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word.upper() for word in words]

word_counts = Counter(cap_words)

在.txt文件中找到最常用单词的Python程序，必须打印单词及其计数

相关阅读

相关文章

相关问答

相关工具

相关文档