python, finding recurring pairs of data

I would like to find all pairs of data that appear together at least 10 times.

For example, given a large input file of keywords:

>>> foo, bar, spam, eggs, durian, stinky tofu, …
>>> fruit, meat, vinegar, sphere, foo, …
>>> merlot, red, hearty, spam, oranges, durian, …
>>> …

“durian” and “spam” appear together twice and all other combinations appear together only once.

Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.

#!/usr/bin/env python
'''
Processes the given input and returns all pairs that appear together in 
at least 10 different lines.

Algorithm Explaination:

Hash each value such that each value maps a 'set' of row indexes 
(i.e., each line in the file)

Next, rather than compare all possible combinations of values, create a 
shorter list of candidate values who have 10 or more row indexes.

The set-intersection will contain all rows that BOTH values were listed in, 
if the size of the set-intersection is greater than 10, output this pair
'''

import codecs, csv, argparse, sys

def setIntersection(infile):
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_sets = {}
    i = 0
    for row in csv_data:
        i = i+1
        for a in row:
            if a in data_sets:
                data_sets[a].add(i)
            else:
                data_sets[a] = set([i])
    candidates = {a:s for a,s in data_sets.items() if len(s) > 10}
    pairc = {a for a in candidates.keys()}
    for a,s in candidates.items():
        pairc.remove(a)
        for c in pairc:
            if len(s & candidates[c]) > 10:
                print('%s, %s, %s' % (a,c,len(s & candidates[c])))

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--output-file', '-o', default='')
    p.add_argument('--bitmap', '-b', action="store_true")
    p.add_argument('files', nargs='+')
    args = p.parse_args()

    if (args.output_file != ''):
        sout = sys.stdout
        fh = open(options.output_file, 'w')
        sys.stdout = fh

    for infile in args.files:
        setIntersection(infile)

    if (args.output_file != ''):
        sys.stdout = sout
        fh.close()

if __name__ == '__main__':
    main()

Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,

def bitmapIntersection(infile):
    '''
    The bitmap example is more general purpose and easily ported to C.  In fact, 
    the bitCount function is a python version of computing Hamming Weight from K&R.

    A simple example illustrates the bitmap,

      keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]
      keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]
             & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND
    '''
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_maps = {}
    i = 0
    for row in csv_data:
        for a in row:
            if a in data_maps:
                data_maps[a] += (2 ** i)
            else:
                data_maps[a] = (2 ** i)
        i = i+1
    candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}
    pairc = {a for a in candidates.keys()}
    for a,b in candidates.items():
        pairc.remove(a)
        for c in pairc:
            bits = bitCount(b & candidates[c])
            if bits > 10:
                print('%s, %s, %s' % (a,c,bits))

def bitCount(int_type):
    ''' ported from Kernighan & Ritchie's "The C Programming Language"
    '''
    count = 0
    while(int_type):
        int_type &= int_type - 1
        count += 1
    return(count)

A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.

/tech

avant.net/tech

python, finding recurring pairs of data