I would like to iterate over a stream of words, say, from STDIN or a file (or any random input stream). Typically, this is done like this,
def iter_words(f): for line in f: for word in line.split(): yield word
And then one can simply,
for word in iter_words(sys.stdin): # do something
For a more concrete example, let’s say we need to keep a count of every unique word in an input stream, something like this,
from collections import Counter c = Counter for word in iter_words(sys.stdin): c.update([word])
The only problem with this approach is that it will read data in line-by-line, which in most cases is exactly what we want, however, in some cases we don’t have line-breaks. For extremely large data streams we will simply run out of memory if we use the above generator.
Instead, we can use the read() method to read in one-byte at a time, and manually construct the words as we go, like this,
def iter_words(sfile): chlist = [] for ch in iter(lambda: sfile.read(1), ''): if str.isspace(ch): if len(chlist) > 0: yield ''.join(chlist) chlist = [] else: chlist.append(ch)
This approach is memory efficient, but extremely slow. If you absolutely need to get the speed while still being memory efficient, you’ll have to do a buffered read, which is kind of an ugly hybrid of these two approaches.
def iter_words(sfile, buffer=1024): lastchunk = '' for chunk in iter(lambda: sfile.read(buffer), ''): words = chunk.split() lastchunk = words[-1] for word in words[:-1]: yield word newchunk = [] for ch in sfile.read(1): if str.isspace(ch): yield lastchunk + ''.join(newchunk) break else: newchunk.append(ch)