iter_words – /tech

I would like to iterate over a stream of words, say, from STDIN or a file (or any random input stream). Typically, this is done like this,

def iter_words(f):
    for line in f:
        for word in line.split():
            yield word

And then one can simply,

for word in iter_words(sys.stdin):
    # do something

For a more concrete example, let’s say we need to keep a count of every unique word in an input stream, something like this,

from collections import Counter
c = Counter

for word in iter_words(sys.stdin):
    c.update([word])

The only problem with this approach is that it will read data in line-by-line, which in most cases is exactly what we want, however, in some cases we don’t have line-breaks. For extremely large data streams we will simply run out of memory if we use the above generator.

Instead, we can use the read() method to read in one-byte at a time, and manually construct the words as we go, like this,

def iter_words(sfile):
    chlist = []
    for ch in iter(lambda: sfile.read(1), ''):
        if str.isspace(ch):
            if len(chlist) > 0:
                yield ''.join(chlist)
            chlist = []
        else:
            chlist.append(ch)

This approach is memory efficient, but extremely slow. If you absolutely need to get the speed while still being memory efficient, you’ll have to do a buffered read, which is kind of an ugly hybrid of these two approaches.

def iter_words(sfile, buffer=1024):
    lastchunk = ''
    for chunk in iter(lambda: sfile.read(buffer), ''):
        words = chunk.split()
        lastchunk = words[-1]
        for word in words[:-1]:
            yield word
        newchunk = []
        for ch in sfile.read(1):
            if str.isspace(ch):
                yield lastchunk + ''.join(newchunk)
                break
            else:
                newchunk.append(ch)