I would like to find all pairs of data that appear together at least 10 times.
For example, given a large input file of keywords:
>>> foo, bar, spam, eggs, durian, stinky tofu, …
>>> fruit, meat, vinegar, sphere, foo, …
>>> merlot, red, hearty, spam, oranges, durian, …
>>> …
“durian” and “spam” appear together twice and all other combinations appear together only once.
Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.
#!/usr/bin/env python ''' Processes the given input and returns all pairs that appear together in at least 10 different lines. Algorithm Explaination: Hash each value such that each value maps a 'set' of row indexes (i.e., each line in the file) Next, rather than compare all possible combinations of values, create a shorter list of candidate values who have 10 or more row indexes. The set-intersection will contain all rows that BOTH values were listed in, if the size of the set-intersection is greater than 10, output this pair ''' import codecs, csv, argparse, sys def setIntersection(infile): datafile = codecs.open(infile, 'r', 'utf-8') csv_data = csv.reader(datafile) data_sets = {} i = 0 for row in csv_data: i = i+1 for a in row: if a in data_sets: data_sets[a].add(i) else: data_sets[a] = set([i]) candidates = {a:s for a,s in data_sets.items() if len(s) > 10} pairc = {a for a in candidates.keys()} for a,s in candidates.items(): pairc.remove(a) for c in pairc: if len(s & candidates[c]) > 10: print('%s, %s, %s' % (a,c,len(s & candidates[c]))) def main(): p = argparse.ArgumentParser() p.add_argument('--output-file', '-o', default='') p.add_argument('--bitmap', '-b', action="store_true") p.add_argument('files', nargs='+') args = p.parse_args() if (args.output_file != ''): sout = sys.stdout fh = open(options.output_file, 'w') sys.stdout = fh for infile in args.files: setIntersection(infile) if (args.output_file != ''): sys.stdout = sout fh.close() if __name__ == '__main__': main()
Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,
def bitmapIntersection(infile): ''' The bitmap example is more general purpose and easily ported to C. In fact, the bitCount function is a python version of computing Hamming Weight from K&R. A simple example illustrates the bitmap, keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ] keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ] & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND ''' datafile = codecs.open(infile, 'r', 'utf-8') csv_data = csv.reader(datafile) data_maps = {} i = 0 for row in csv_data: for a in row: if a in data_maps: data_maps[a] += (2 ** i) else: data_maps[a] = (2 ** i) i = i+1 candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10} pairc = {a for a in candidates.keys()} for a,b in candidates.items(): pairc.remove(a) for c in pairc: bits = bitCount(b & candidates[c]) if bits > 10: print('%s, %s, %s' % (a,c,bits)) def bitCount(int_type): ''' ported from Kernighan & Ritchie's "The C Programming Language" ''' count = 0 while(int_type): int_type &= int_type - 1 count += 1 return(count)
A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.