{"id":548,"date":"2011-11-07T04:32:34","date_gmt":"2011-11-07T04:32:34","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=548"},"modified":"2012-12-25T22:38:42","modified_gmt":"2012-12-25T22:38:42","slug":"python-finding-recurring-pairs-of-data","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/python-finding-recurring-pairs-of-data\/","title":{"rendered":"python, finding recurring pairs of data"},"content":{"rendered":"<p>I would like to find all pairs of data that appear together at least 10 times.<\/p>\n<p>For example, given a large input file of keywords:<\/p>\n<p>>>> foo, bar, spam, eggs, durian, stinky tofu, &#8230;<br \/>\n>>> fruit, meat, vinegar, sphere, foo, &#8230;<br \/>\n>>> merlot, red, hearty, spam, oranges, durian, &#8230;<br \/>\n>>> &#8230;<\/p>\n<p>&#8220;durian&#8221; and &#8220;spam&#8221; appear together twice and all other combinations appear together only once.<\/p>\n<p>Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.<\/p>\n<pre class=\"sh_python\">\r\n#!\/usr\/bin\/env python\r\n'''\r\nProcesses the given input and returns all pairs that appear together in \r\nat least 10 different lines.\r\n\r\nAlgorithm Explaination:\r\n\r\nHash each value such that each value maps a 'set' of row indexes \r\n(i.e., each line in the file)\r\n\r\nNext, rather than compare all possible combinations of values, create a \r\nshorter list of candidate values who have 10 or more row indexes.\r\n\r\nThe set-intersection will contain all rows that BOTH values were listed in, \r\nif the size of the set-intersection is greater than 10, output this pair\r\n'''\r\n\r\nimport codecs, csv, argparse, sys\r\n\r\ndef setIntersection(infile):\r\n    datafile = codecs.open(infile, 'r', 'utf-8')\r\n    csv_data = csv.reader(datafile)\r\n    data_sets = {}\r\n    i = 0\r\n    for row in csv_data:\r\n        i = i+1\r\n        for a in row:\r\n            if a in data_sets:\r\n                data_sets[a].add(i)\r\n            else:\r\n                data_sets[a] = set([i])\r\n    candidates = {a:s for a,s in data_sets.items() if len(s) > 10}\r\n    pairc = {a for a in candidates.keys()}\r\n    for a,s in candidates.items():\r\n        pairc.remove(a)\r\n        for c in pairc:\r\n            if len(s & candidates[c]) > 10:\r\n                print('%s, %s, %s' % (a,c,len(s & candidates[c])))\r\n\r\ndef main():\r\n    p = argparse.ArgumentParser()\r\n    p.add_argument('--output-file', '-o', default='')\r\n    p.add_argument('--bitmap', '-b', action=\"store_true\")\r\n    p.add_argument('files', nargs='+')\r\n    args = p.parse_args()\r\n\r\n    if (args.output_file != ''):\r\n        sout = sys.stdout\r\n        fh = open(options.output_file, 'w')\r\n        sys.stdout = fh\r\n\r\n    for infile in args.files:\r\n        setIntersection(infile)\r\n\r\n    if (args.output_file != ''):\r\n        sys.stdout = sout\r\n        fh.close()\r\n\r\nif __name__ == '__main__':\r\n    main()\r\n<\/pre>\n<p>Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,<\/p>\n<pre class=\"sh_python\">\r\ndef bitmapIntersection(infile):\r\n    '''\r\n    The bitmap example is more general purpose and easily ported to C.  In fact, \r\n    the bitCount function is a python version of computing Hamming Weight from K&R.\r\n\r\n    A simple example illustrates the bitmap,\r\n\r\n      keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]\r\n      keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]\r\n             & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND\r\n    '''\r\n    datafile = codecs.open(infile, 'r', 'utf-8')\r\n    csv_data = csv.reader(datafile)\r\n    data_maps = {}\r\n    i = 0\r\n    for row in csv_data:\r\n        for a in row:\r\n            if a in data_maps:\r\n                data_maps[a] += (2 ** i)\r\n            else:\r\n                data_maps[a] = (2 ** i)\r\n        i = i+1\r\n    candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}\r\n    pairc = {a for a in candidates.keys()}\r\n    for a,b in candidates.items():\r\n        pairc.remove(a)\r\n        for c in pairc:\r\n            bits = bitCount(b & candidates[c])\r\n            if bits > 10:\r\n                print('%s, %s, %s' % (a,c,bits))\r\n\r\ndef bitCount(int_type):\r\n    ''' ported from Kernighan & Ritchie's \"The C Programming Language\"\r\n    '''\r\n    count = 0\r\n    while(int_type):\r\n        int_type &= int_type - 1\r\n        count += 1\r\n    return(count)\r\n<\/pre>\n<p>A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the <a href=\"http:\/\/pypi.python.org\/pypi\/BitVector\/3.0\">BitVector<\/a> module.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I would like to find all pairs of data that appear together at least 10 times. For example, given a large input file of keywords: >>> foo, bar, spam, eggs, durian, stinky tofu, &#8230; >>> fruit, meat, vinegar, sphere, foo, &#8230; >>> merlot, red, hearty, spam, oranges, durian, &#8230; >>> &#8230; &#8220;durian&#8221; and &#8220;spam&#8221; appear [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/548"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=548"}],"version-history":[{"count":5,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/548\/revisions"}],"predecessor-version":[{"id":706,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/548\/revisions\/706"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=548"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=548"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=548"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}