{"id":529,"date":"2011-10-17T17:36:20","date_gmt":"2011-10-17T17:36:20","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=529"},"modified":"2012-12-25T22:39:24","modified_gmt":"2012-12-25T22:39:24","slug":"python-unique-files-by-content","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/python-unique-files-by-content\/","title":{"rendered":"python, unique files by content"},"content":{"rendered":"<p>I would like to retrieve a list of unique files by content rather than by filename.<\/p>\n<p>That is, if spam.txt and eggs.txt both contained the same contents I want only one of them to return.  A very simple approach is to compute a SHA-1 checksum on each file, and build a dictionary with the checksum as the unique key.<\/p>\n<pre class=\"sh_python\">\r\n#!\/usr\/bin\/env python\r\n# vim: set tabstop=4 shiftwidth=4 autoindent smartindent:\r\nimport hashlib, sys\r\nimport logging\r\n\r\ndef _dupdoc(filelist):\r\n\t'''\r\n\treturns a list of unique files (by content rather than filename)\r\n\tthat is, if spam.txt and eggs.txt both contained the same contents, \r\n\tonly one filename will be returned\r\n\t'''\r\n\tshasums = {}\r\n\tfor file in filelist:\r\n\t\ttry:\r\n\t\t\tfh = open(file, 'rb')\r\n\t\t\tsha1 = hashlib.sha1(fh.read()).hexdigest()\r\n\t\t\tif sha1 not in shasums:\r\n\t\t\t\tshasums[sha1] = file\r\n\t\t\t\tlogging.debug('%s %s' %(file, sha1))\r\n\t\texcept IOError as e:\r\n\t\t\tlogging.warning('could not open %s' %(file))\r\n\tuniquelist = [file for file in shasums.values()]\r\n\treturn uniquelist\r\n\r\n\r\nif __name__ == \"__main__\":\r\n\t'''\r\n\tcommand-line, accept either a list of files in STDIN\r\n\tor a single filename argument that contains a list of files\r\n\t'''\r\n\r\n\tfilelist = []\r\n\tif len(sys.argv) > 1:\r\n\t\tfh = open(sys.argv[1], 'r')\r\n\t\tfilelist = fh.readlines()\r\n\t\tfh.close()\r\n\telse:\r\n\t\tfilelist = sys.stdin.readlines()\r\n\tfilelist = [file.strip() for file in filelist]\r\n\tuniques = _dupdoc(filelist)\r\n\tfor file in uniques:\r\n\t\tprint file\r\n\r\n<\/pre>\n<p>The commandline __main__ portion of the program expects an optional command line argument, or if no argument is specified than a filelist will be read on STDIN, e.g.,<\/p>\n<pre>\r\n#  find test -type f | dupdoc\r\ntest\/spam1.txt\r\ntest\/spam9.txt\r\n# \r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I would like to retrieve a list of unique files by content rather than by filename. That is, if spam.txt and eggs.txt both contained the same contents I want only one of them to return. A very simple approach is to compute a SHA-1 checksum on each file, and build a dictionary with the checksum [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[6,14],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/529"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=529"}],"version-history":[{"count":5,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/529\/revisions"}],"predecessor-version":[{"id":709,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/529\/revisions\/709"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=529"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=529"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=529"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}