当前位置: 首页 > 面试题库 >

python difflib比较文件

仰雅昶
2023-03-14
问题内容

我正在尝试使用difflib为包含推文的两个文本文件生成diff。这是代码:

#!/usr/bin/env python

# difflib_test

import difflib

file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')

diff = difflib.context_diff(file1.readlines(), file2.readlines())
delta = ''.join(diff)
print delta

这是PTITVProgs文本文件:

Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI

这是new_tweets文本文件:

Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
@ImranKhanPTI
Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Asad Umar at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
Watch PTI on April 6th (5) Waleed Iqbal at 8PM on Channel 5. Rgds #PTI
Watch PTI on April 6th (3) Dr Israr Shah at 10PM on PTV News. Rgds #PTI
Watch PTI on April 6th (4) Javed hashmi at 1PM on PTV News. Rgds #PTI
Watch PTI on April 6th (3) Imran Alvi at 1PM on AAJ News. Rgds #PTI
Watch PTI on April 6th (1) Dr @ArifAlvi, Andleeb Abbas and Ehtisham Ameer at 11PM on ARY News (2) Hamid Khan at 10PM on ATV. Rgds #PTI
Watch PTI on April 5th (1) Farooq Amjad Meer at 10:45PM on Dunya News. Rgds #PTI
Watch PTI on April 4th (4) Faisal Khan at 8PM on PTV News. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (3) Faisal Khan at 11PM on ATV. Rgds #PTI
@FaisalJavedKhan
Watch PTI on April 4th (1) Dr Israr Shah at 8PM on Waqt News (2) Dr Arif Alvi at 9PM on PTV World. Rgds #PTI
@ArifAlvi
Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI

这是我从程序中得到的差异:

*** 
--- 
***************
*** 1,7 ****
- Watch PTI on April 7th (3) Malaika Reza at 8PM on AAJ News (4) Shah Mehmood Qureshi at 8PM on Geo News. Rgds #PTI
- Watch PTI on April 7th (2) Chairman IMRAN KHAN at 10PM on PTV News in News Night with Sadia Afzal, Rpt: 2AM, 2PM. Rgds #PTI
- @ImranKhanPTI
- Watch PTI on April 7th (1) Dr Waseem Shahzad NOW at 6PM on PTV News. Rgds #PTI
  Watch PTI on April 6th (7) Dr Israr Shah at 10PM on Business Plus in "Talking Policy". Rgds #PTI
  CORRECTION!! Watch PTI on April 6th (5) @Asad_Umar  at 8PM on ARY News. Rgds #PTI
  Watch PTI on April 6th (5) @Asad_Umar  at 8PM on AAJ News (6) PTI vs PMLN at 8PM on NewsOne. Rgds #PTI
--- 1,3 ----
***************
*** 21,24 ****
  Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
  Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
  Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI--- 17,23 ----
  Watch PTI on April 3rd (12) Abrar ul Haq on 10PM on Dawn News (13) Shabbir Sial at 10PM on Channel5. Rgds #PTI
  Watch PTI on April 3rd (11) Sadaqat Abbasi on 8PM on RohiTV. Rgds #PTI
  Watch PTI on April 3rd (10) Dr Zarqa and Andleeb Abbas on 8PM on Waqt News. Rgds #PTI
! Watch PTI on April 3rd (9) Fauzia Kasuri at 8PM on Din News. Rgds #PTI
! Watch PTI on April 3rd (8) Mehmood Rasheed at 8PM on ARY News. Rgds #PTI
! Watch PTI on April 3rd (7) Israr Abbasi (Repeat on Arp 4th) at 1:20AM and 1PM on Vibe TV. Rgds #PTI
! Watch PTI on April 3rd (5) Rao Fahad at 9PM on Express News (6) Dr Seems Zia at 10:30PM on Health TV. Rgds #PTI

正如你可以快速地比较两个源文件(PTITVProgs和new_tweets)它们之间的区别是看到 了3个鸣叫是4月7日4月3日3个鸣叫

我只希望其中的行new_tweets不出PTITVProgs现在差异中。

但这会抛出一堆我不想看到的文本。我不知道是什么*** 1,7***,并*** 1,3***在差异中输出立场…?
获得更改的行的正确方法是什么?


问题答案:

只需像这样解析diff的输出(如果需要,将’-‘更改为’+’):

#!/usr/bin/env python

# difflib_test

import difflib

file1 = open('/home/saad/Code/test/new_tweets', 'r')
file2 = open('/home/saad/PTITVProgs', 'r')

diff = difflib.ndiff(file1.readlines(), file2.readlines())
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta


 类似资料:
  • 问题内容: 我想比较位于两个不同文件夹中的文件。我只希望比较两个不同文件夹中具有相同名称的文件。 我希望做的是比较一个软件的两个不同版本,并发现已更改了多少文件。 问题答案: 这将帮助您获取两个路径的文件: 您将需要添加自己的逻辑进行比较。资源

  • 示例数据 # filecmp_mkexamples.py import os def mkfile(filename, body=None): with open(filename, 'w') as f: f.write(body or filename) return def make_example_dir(top): if not os.pat

  • Android Studio集成的Git提供了丰富的文件比较功能,我们可以将本地文件与远程仓库中的、某次提交的或其它分支的文件进行比较. 可以通过如下操作方法使用比较功能: 方法一: 右击某一个文件或右击文件的编辑区 —> Git. 方法二: 菜单栏 —> VCS —> Git 方法三: Version Control —> 右击有变更的文件 —> Git 比较功能有下面这几个: Compare

  • 问题内容: 我正在使用以下方法比较junit中的文本文件: 这是比较文本文件的好方法吗?什么是首选? 问题答案: junit-addons对它有很好的支持:FileAssert 它为您提供了如下异常:

  • 我有一个关于compareTo函数如何帮助比较器排序的问题,即o1。比较(o2)与o2。比较(o1) 如果两个字符串相等,则此方法返回0,否则返回正值或负值。如果第一个字符串在词典上大于第二个字符串,则结果为正,否则结果为负。 上面的陈述很简单,但是为什么o1.compare(o2)会给我一个升序,而o2.compare(o1)给了我一个降序? 如果我有整数值“5,10,3”,我得到3,5,10和

  • 数据库管理系统(DBMS)和文件系统之间存在以下差异: 数据库管理系统(DBMS) 文件系统 DBMS是一组数据。在DBMS中,用户不需要编写过程。 文件系统是数据的集合。在该系统中,用户必须编写用于管理数据库的过程。 DBMS提供隐藏详细信息的数据的抽象视图。 文件系统提供数据表示和数据存储的详细信息。 DBMS提供崩溃恢复机制,即DBMS保护用户免受系统故障的影响。 文件系统没有崩溃机制,即,