{"id":535,"date":"2011-11-02T16:35:29","date_gmt":"2011-11-02T16:35:29","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=535"},"modified":"2012-12-25T22:39:21","modified_gmt":"2012-12-25T22:39:21","slug":"python-analyzing-csv-files-part-1","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/python-analyzing-csv-files-part-1\/","title":{"rendered":"python, analyzing csv files, part 1"},"content":{"rendered":"<p>I would like to analyze a collection of CSV (comma-separated-values) files in python. Ideally, I would like to treat the csv data as a native python object.<\/p>\n<p>For example,<\/p>\n<pre class=\"sh_python\">\r\n>>> financial_detail = Report('financial-detail.csv')\r\n>>> transactions = {}\r\n>>> for row in financial_detail:\r\n...   transactions.append(row['Transaction'])\r\n...\r\n>>> financial_detail.sum('Tax Amount')\r\nDecimal('123456.10')\r\n>>> \r\n<\/pre>\n<p>Additionally, I would like to easily retrieve row data by column name, e.g.,<\/p>\n<pre class=\"sh_python\">\r\n>>> financial_detail = Report('financial-detail.csv')\r\n>>> \r\n>>> financial_detail[0]['Transaction']\r\n'6324565'\r\n>>> \r\n>>> ## the above should be equivalent to\r\n>>> financial_detail.headers.index('Transaction')\r\n7\r\n>>> financial_detail[0][7]\r\n'6324565'\r\n>>> \r\n<\/pre>\n<p>Given a directory of CSV files, I would like to retrieve these files (magically) by name, e.g.,<\/p>\n<pre class=\"sh_python\">\r\n>>> reports = Reports('\/directory\/of\/reports\/')\r\n>>> len(reports.sales_tax)\r\n4884\r\n>>> \r\n>>> len(reports.finance_detail)\r\n512916\r\n>>> reports.sales_tax.sum('Amount') - reports.financial_detail.sum('Tax Amount')\r\nDecimal('0.0')\r\n>>> \r\n<\/pre>\n<p>This can be accomplished by leveraging the python builtin csv module, the collections.Mapping abstract base class, as well as the python magic __getattr__ methods, i.e.,<\/p>\n<pre class=\"sh_python\">\r\n!\/usr\/bin\/env python\r\n# vim: set tabstop=4 shiftwidth=4 autoindent smartindent:\r\nimport csv, glob, os, sys \r\nimport collections\r\nfrom decimal import Decimal\r\n\r\n## set the logging level\r\nimport logging\r\nlogging.basicConfig(\r\n  level=logging.DEBUG,\r\n  format='%(levelname)s:%(filename)s:%(lineno)d -- %(message)s'\r\n)\r\n\r\n\r\n\r\nclass Report(collections.Mapping):\r\n    ''' \r\n    Interface into a CSV Report\r\n\r\n    Parses the csv into an iterable collection and provides memory of pivot sums\r\n    '''\r\n\r\n    def __init__(self, *filehints):\r\n        self.filename = Reports.find_report(*filehints)\r\n        self.info = []\r\n        self.headers = []\r\n        self.data = []\r\n        self.pivot_sums = {}\r\n        self.load()\r\n\r\n    def load(self):\r\n        logging.debug('loading %s' %(self.filename))\r\n        fh = open(self.filename)\r\n        reader = csv.reader(fh)\r\n        self.info = reader.next()\r\n        self.headers = reader.next()\r\n        for row in reader:\r\n            self.data.append(row)\r\n        self.curval = 0 \r\n\r\n    def __getitem__(self, key):\r\n        return VirtualRecord(self.data[key], self)\r\n\r\n    def __iter__(self):\r\n        self.curval = 0 \r\n        return self\r\n\r\n    def next(self):\r\n        if self.curval < 0 or self.curval >= (len(self.data) - 1): \r\n            raise StopIteration\r\n        else:\r\n            self.curval += 1\r\n            return VirtualRecord(self.data[self.curval], self)\r\n\r\n    def __len__(self):\r\n        return len(self.data)\r\n\r\n    def sum(self, col):\r\n        if col not in self.pivot_sums:\r\n            logging.debug('computing sum for %s in %s' %(col, ', '.join(self.info)))\r\n            sum = Decimal(0)\r\n            index = col\r\n            if isinstance(index, str):\r\n                index = [x.lower() for x in self.headers].index(index.lower())\r\n            for i in self.data:\r\n                sum += Decimal(i[index]) if i[index] != '' else 0\r\n            self.pivot_sums[col] = sum\r\n        return self.pivot_sums[col]\r\n\r\n\r\nclass VirtualRecord():\r\n    '''\r\n    A virtual record within a Report object, only one instance exists at a time (not-thread safe) \r\n    to avoid the memory crunch from large report files\r\n    '''\r\n    _instance = None\r\n    def __new__(self, *args, **kwargs):\r\n        if not self._instance:\r\n            self._instance = super(VirtualRecord, self).__new__(self)\r\n        return self._instance\r\n\r\n    def __init__(self, row, parent):\r\n        self.row = row\r\n        self.parent = parent\r\n\r\n    def __getitem__(self, key):\r\n        index = key\r\n        if isinstance(key, str):\r\n            index = [x.lower() for x in self.parent.headers].index(key.lower())\r\n        return self.row[index]\r\n\r\n    def __len__(self):\r\n        return len(self.row)\r\n\r\n\r\nclass Reports:\r\n    '''\r\n    Magic collection of billing reports\r\n    '''\r\n\r\n    reports = {}\r\n\r\n    def __init__(self, reportdir = '.'):\r\n        self.reportdir = reportdir\r\n\r\n    def __getattr__(self, report):\r\n        '''\r\n        magically find reports within self.reportdir\r\n\r\n        For example, if there is a report 'sales-tax.2011-09.20111018.csv'\r\n        You can access this with an attribute 'sales_tax', e.g.,\r\n        >>> \r\n        >>> reports = Reports()\r\n        >>> len(reports.sales_tax)\r\n        4884\r\n        >>> \r\n        '''\r\n        filename = Reports.find_report(self.reportdir, report.replace('_', '?') + '.*.csv')\r\n        if filename is None or not os.stat(filename):\r\n            return None\r\n        if report not in self.reports:\r\n            self.reports[report] = Report(filename, table=report)\r\n        return self.reports[report]\r\n\r\n    @classmethod\r\n    def find_report(self, *hints):\r\n        files = glob.glob( os.path.join(*hints) )\r\n        if len(files) == 1:\r\n            return files[0]\r\n        elif len(files) == 0:\r\n            logging.error('No reports found: find_report(%s)' %(os.path.join(*hints)))\r\n        else:\r\n            logging.error('Found too many matching reports: find_report(%s) => %s' %(os.path.join(*hints), ', '.join(files) ))\r\n\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I would like to analyze a collection of CSV (comma-separated-values) files in python. Ideally, I would like to treat the csv data as a native python object. For example, >>> financial_detail = Report(&#8216;financial-detail.csv&#8217;) >>> transactions = {} >>> for row in financial_detail: &#8230; transactions.append(row[&#8216;Transaction&#8217;]) &#8230; >>> financial_detail.sum(&#8216;Tax Amount&#8217;) Decimal(&#8216;123456.10&#8217;) >>> Additionally, I would like to [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[16,6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/535"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=535"}],"version-history":[{"count":7,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/535\/revisions"}],"predecessor-version":[{"id":708,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/535\/revisions\/708"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}