{"id":1010,"date":"2019-06-08T06:18:32","date_gmt":"2019-06-08T06:18:32","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=1010"},"modified":"2019-06-08T17:49:02","modified_gmt":"2019-06-08T17:49:02","slug":"trie-or-set","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/trie-or-set\/","title":{"rendered":"Trie or Set"},"content":{"rendered":"\n<p>Given a grid or input stream of characters, I would like to discover all words according to a given dictionary. This could be a dictionary of all English words or phrases (say, for an autocomplete service), or for any language. This is especially useful for languages where words are not clearly separated (e.g., Japanese, Chinese, Thai).<\/p>\n\n\n\n<p>Typically, this is done with a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Trie\" target=\"_blank\" rel=\"noopener noreferrer\">Trie<\/a> or a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Deterministic_acyclic_finite_state_automaton\" target=\"_blank\" rel=\"noopener noreferrer\">DAWG<\/a> (Directed Acyclic Word Graph). A Trie can be implemented in Python using a nested dict, i.e.,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def _make_trie(wordict):\n    trie = {}\n    for word in wordict:\n        current_trie = trie\n        for letter in word:\n            current_trie = current_trie.setdefault(letter, {})\n        current_trie['$'] = '$'\n    return trie\n\ndef _in_trie(trie, word):\n    ''' True IFF prefix or word in trie\n    '''\n    current_trie = trie\n    for letter in word:\n        if letter in current_trie:\n            current_trie = current_trie[letter]\n        else:\n            return False\n    return True<\/pre>\n\n\n\n<p>Using this approach, we can scan through a large stream of characters for potential words. Imagine a classic matching game where you are looking for words within a grid of characters. Programmatically, you would scan through the grid with combinations of characters. The advantage of a Trie (or DAWG) is that it allows for efficient pruning. In other words, if a character combination is not in the Trie, then you can cease pruning.<\/p>\n\n\n\n<p>An alternative approach is to create a Set of word prefixes, i.e.,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">almost_words = set([])\nfor word in wordict:\n    for i in range(len(word)-1):\n        almost_words.add( word[0:i+1] )<\/pre>\n\n\n\n<p>If the dictionary contains [&#8216;apple&#8217;, &#8216;are&#8217;] then the Set <em>almost_words<\/em> would contain the following,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">{'a', 'ap', 'app', 'appl', 'ar'}<\/pre>\n\n\n\n<p>In other words, rather than test if a character string exists in the Trie, one can simply check the Set <em>almost_words<\/em>. If there is no match then that particular path can be pruned. Here is a simple RTL (right-to-left) character scanner that uses this approach:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def _setscan_rtl(grid, wordict):\n    ''' generator yielding word candidates\n    '''\n    almost_words = set([])\n    maxlen = 0\n    for word in wordict:\n        if len(word) > maxlen:\n            maxlen = len(word)\n        for i in range(len(word)-1):\n            almost_words.add( word[0:i+1] )\n    for line in grid:\n        for i in range(max(len(line),maxlen) - maxlen):\n            candidate_word = ''\n            for c in range(min(len(line),maxlen)):\n                candidate_word += line[i+c]\n                if candidate_word not in almost_words:\n                    break\n                yield candidate_word<\/pre>\n\n\n\n<p>I created a simple test case to determine if a Set was truly faster, and whether or not it was as memory efficient. There was a noticeable increase in performance using Set over Trie (for both large and small data sets). Interestingly, the performance difference was even more pronounced when using Japanese characters, indicating that language parsers can use a simple Set (or hashmap) as opposed to a Trie or a DAWG.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">$ \/usr\/bin\/time .\/test_j_set.py\n50220\n177.84user 0.42system 2:58.54elapsed 99%CPU (0avgtext+0avgdata 507412maxresident)k\n0inputs+0outputs (0major+145801minor)pagefaults 0swaps\n\n$ \/usr\/bin\/time .\/test_j_trie.py\n50220\n250.44user 0.56system 4:11.86elapsed 99%CPU (0avgtext+0avgdata 680960maxresident)k\n0inputs+0outputs (0major+184571minor)pagefaults 0swaps<\/pre>\n\n\n\n<p>Full results and code are available on my <a rel=\"noopener noreferrer\" href=\"https:\/\/github.com\/timwarnock\/wordscanner.py\" target=\"_blank\">github<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Given a grid or input stream of characters, I would like to discover all words according to a given dictionary. This could be a dictionary of all English words or phrases (say, for an autocomplete service), or for any language. This is especially useful for languages where words are not clearly separated (e.g., Japanese, Chinese, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16,6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1010"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=1010"}],"version-history":[{"count":5,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1010\/revisions"}],"predecessor-version":[{"id":1015,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1010\/revisions\/1015"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=1010"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=1010"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=1010"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}