I would like to find all pairs of data that appear together at least 10 times.
For example, given a large input file of keywords:
>>> foo, bar, spam, eggs, durian, stinky tofu, …
>>> fruit, meat, vinegar, sphere, foo, …
>>> merlot, red, hearty, spam, oranges, durian, …
>>> …
“durian” and “spam” appear together twice and all other combinations appear together only once.
Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.
#!/usr/bin/env python
'''
Processes the given input and returns all pairs that appear together in
at least 10 different lines.
Algorithm Explaination:
Hash each value such that each value maps a 'set' of row indexes
(i.e., each line in the file)
Next, rather than compare all possible combinations of values, create a
shorter list of candidate values who have 10 or more row indexes.
The set-intersection will contain all rows that BOTH values were listed in,
if the size of the set-intersection is greater than 10, output this pair
'''
import codecs, csv, argparse, sys
def setIntersection(infile):
datafile = codecs.open(infile, 'r', 'utf-8')
csv_data = csv.reader(datafile)
data_sets = {}
i = 0
for row in csv_data:
i = i+1
for a in row:
if a in data_sets:
data_sets[a].add(i)
else:
data_sets[a] = set([i])
candidates = {a:s for a,s in data_sets.items() if len(s) > 10}
pairc = {a for a in candidates.keys()}
for a,s in candidates.items():
pairc.remove(a)
for c in pairc:
if len(s & candidates[c]) > 10:
print('%s, %s, %s' % (a,c,len(s & candidates[c])))
def main():
p = argparse.ArgumentParser()
p.add_argument('--output-file', '-o', default='')
p.add_argument('--bitmap', '-b', action="store_true")
p.add_argument('files', nargs='+')
args = p.parse_args()
if (args.output_file != ''):
sout = sys.stdout
fh = open(options.output_file, 'w')
sys.stdout = fh
for infile in args.files:
setIntersection(infile)
if (args.output_file != ''):
sys.stdout = sout
fh.close()
if __name__ == '__main__':
main()
Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,
def bitmapIntersection(infile):
'''
The bitmap example is more general purpose and easily ported to C. In fact,
the bitCount function is a python version of computing Hamming Weight from K&R.
A simple example illustrates the bitmap,
keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]
keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]
& : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND
'''
datafile = codecs.open(infile, 'r', 'utf-8')
csv_data = csv.reader(datafile)
data_maps = {}
i = 0
for row in csv_data:
for a in row:
if a in data_maps:
data_maps[a] += (2 ** i)
else:
data_maps[a] = (2 ** i)
i = i+1
candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}
pairc = {a for a in candidates.keys()}
for a,b in candidates.items():
pairc.remove(a)
for c in pairc:
bits = bitCount(b & candidates[c])
if bits > 10:
print('%s, %s, %s' % (a,c,bits))
def bitCount(int_type):
''' ported from Kernighan & Ritchie's "The C Programming Language"
'''
count = 0
while(int_type):
int_type &= int_type - 1
count += 1
return(count)
A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.