Reading Text Files
June 09, 2000 | Fredrik Lundh
This is somewhat outdated, given the additions of xreadlines in 2.1 and text file iterators in 2.2. See the end of the page for examples.
This very brief note discusses a few more or less efficient ways to read lines of text from a file.
Doing it the usual way
The standard idiom consists of a an ‘endless’ while loop, in which we repeatedly call the file’s readline method. Here’s an example:
# File: readline-example-1.py
file = open("sample.txt")
while 1:
line = file.readline()
if not line:
break
pass # do something
This snippet reads the file line by line. If readline reaches the end of the file, it returns an empty string. Otherwise, it returns the line of text, including the trailing newline character.
On my test machine, using a 10 megabyte sample text file, this script reads about32,000 lines per second.
Using the fileinput module
If you think the while loop is ugly, you can hide the readline call in a wrapper class. The standard fileinput module contains an input class which does exactly that.
# File: readline-example-2.py
import fileinput
for line in fileinput.input("sample.txt"):
pass
However, adding more layers of Python code doesn’t exactly help. For the same test setup, performance drops to 13,000 lines per second. That’s nearly two and half times slower!
Speeding up line reading
To speed things up, we obviously need to make sure we spend as little time on in Python code (running under the interpreter) as possible.
One way to do this is to tell the file object to read larger chunks of data. For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method. Or you could even use the read method to read the entire file into a single memory block, and then use string.split to chop it up into individual lines.
However, if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable. For example, if you read a few thousand lines at a time, you probably won’t use up more than 100 kilobytes or so.
The following script uses a nested loop. The outer loop uses readlines to read about 100,000 bytes of text, and the inner loop processes those lines using a simple for-inloop:
# File: readline-example-3.py
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
Can this really be faster? You bet. With the same test data, we can now process96,900 lines of text per second!
Or to put it another way, this solution is three times as fast as the standard solution, and over seven times faster than the fileinput version.
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better:
# File: readline-example-5.py
file = open("sample.txt")
for line in file:
pass # do something
In Python 2.1, you have to use the xreadlines iterator factory instead:
# File: readline-example-4.py
file = open("sample.txt")
for line in file.xreadlines():
pass # do something
Copyright © 2000 Fredrik Lundh
This is outdated. Start by discussing the "in" and "while 1" patterns, and take it from there.
Posted by Fredrik