{"id":978,"date":"2019-05-25T22:11:07","date_gmt":"2019-05-25T22:11:07","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=978"},"modified":"2019-05-25T22:17:41","modified_gmt":"2019-05-25T22:17:41","slug":"iter_words","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/iter_words\/","title":{"rendered":"iter_words"},"content":{"rendered":"\n<p>I would like to iterate over a stream of words, say, from STDIN or a file (or any random input stream). Typically, this is done like this,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def iter_words(f):\n    for line in f:\n        for word in line.split():\n            yield word<\/pre>\n\n\n\n<p>And then one can simply,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">for word in iter_words(sys.stdin):\n    # do something<\/pre>\n\n\n\n<p>For a more concrete example, let&#8217;s say we need to keep a count of every unique word in an input stream, something like this,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">from collections import Counter\nc = Counter\n\nfor word in iter_words(sys.stdin):\n    c.update([word])<\/pre>\n\n\n\n<p>The only problem with this approach is that it will read data in  line-by-line, which in most cases is exactly what we want, however, in  some cases we don&#8217;t have line-breaks. For extremely large data streams we will simply run out of memory if we use the above generator.<\/p>\n\n\n\n<p>Instead, we can use the read() method to read in one-byte at a time, and manually construct the words as we go, like this,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def iter_words(sfile):\n    chlist = []\n    for ch in iter(lambda: sfile.read(1), ''):\n        if str.isspace(ch):\n            if len(chlist) > 0:\n                yield ''.join(chlist)\n            chlist = []\n        else:\n            chlist.append(ch)\n<\/pre>\n\n\n\n<p>This approach is memory efficient, but extremely slow. If you absolutely need to get the speed while still being memory efficient, you&#8217;ll have to do a buffered read, which is kind of an ugly hybrid of these two approaches.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def iter_words(sfile, buffer=1024):\n    lastchunk = ''\n    for chunk in iter(lambda: sfile.read(buffer), ''):\n        words = chunk.split()\n        lastchunk = words[-1]\n        for word in words[:-1]:\n            yield word\n        newchunk = []\n        for ch in sfile.read(1):\n            if str.isspace(ch):\n                yield lastchunk + ''.join(newchunk)\n                break\n            else:\n                newchunk.append(ch)\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I would like to iterate over a stream of words, say, from STDIN or a file (or any random input stream). Typically, this is done like this, And then one can simply, For a more concrete example, let&#8217;s say we need to keep a count of every unique word in an input stream, something like [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/978"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=978"}],"version-history":[{"count":4,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/978\/revisions"}],"predecessor-version":[{"id":982,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/978\/revisions\/982"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=978"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=978"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}