{"id":31,"date":"2011-05-26T05:58:13","date_gmt":"2011-05-26T05:58:13","guid":{"rendered":"http:\/\/www.anattasoftware.com\/blog\/?p=31"},"modified":"2012-12-25T22:40:48","modified_gmt":"2012-12-25T22:40:48","slug":"scraping-and-parsing-html","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/scraping-and-parsing-html\/","title":{"rendered":"scraping and parsing html"},"content":{"rendered":"<p>I would like to read J. Krishnamurti books on my Kindle. Unfortunately, no ebooks were available although I did find that <a href=\"http:\/\/jkrishnamurti.org\">jkrishnamurti.org<\/a> has an extensive collection of books on their website.  At present there is no full download, only a per-chapter html viewer, and some of the books ran over 80 chapters, which is more than I am going to copy+paste into a text file.<\/p>\n<p>I decided I&#8217;d use python and the HTMLParser library and write a throw-away parser.  I realized parsing multi-page websites into text files might be useful for other purposes so I wrote a simple closure that requires two parse functions and returns a custom scraper function that will scrape all pages of an article or book and save it as a single plain text file.<\/p>\n<p>One parse function must return the URL of the next webpage to scrape, and the other parse function must return the HTML of the readable text.  Basically, pass in two parse functions (specific for whatever website you&#8217;re attempting to scrape) into the closure and it will return a scraper utilizing those parse functions.<\/p>\n<pre class=\"sh_python\">\nfrom posixpath import basename, dirname\nfrom traceback import print_exc\nimport urllib, StringIO, HTMLParser, re, sys\n\ndef make_scraper(findtext, findnext):\n\n    def scraper(inurl, outfilename):\n        base_url = dirname(inurl)\n        next_url = basename(inurl)\n        chapter = 1\n        while True:\n            html = urllib.urlopen(base_url + '\/' + next_url).read()\n            match = findtext(html)\n            if match:\n                print \"chapter %s - %s\" % (chapter, next_url)\n                f = open(outfilename, 'a')\n                f.write('nnCHAPTER %snn%s' % (chapter, dehtml(match.group(1))) )\n                f.close\n                next_match = findnext(html)\n                if next_match:\n                    next_url = next_match.group(1)\n                    chapter += 1\n                else:\n                    break\n            else:\n                break\n\n    return scraper\n<\/pre>\n<p>The dehtml function is a very simple HTMLParser implementation that strips out all the HTML tags and maintains line and paragraph breaks.<\/p>\n<pre class=\"sh_python\">\nclass _DeHTMLParser(HTMLParser.HTMLParser):\n    def __init__(self):\n        HTMLParser.HTMLParser.__init__(self)\n        self.__text = []\n\n    def handle_data(self, data):\n        text = data.strip()\n        if len(text) > 0:\n            text = re.sub('[ trn]+', ' ', text)\n            self.__text.append(text + ' ')\n\n    def handle_starttag(self, tag, attrs):\n        if tag == 'p':\n            self.__text.append('nn')\n        elif tag == 'br':\n            self.__text.append('n')\n\n    def handle_startendtag(self, tag, attrs):\n        if tag == 'br':\n            self.__text.append('nn')\n\n    def text(self):\n        return ''.join(self.__text).strip()\n\ndef dehtml(text):\n    try:\n        parser = _DeHTMLParser()\n        parser.feed(text)\n        parser.close()\n        return parser.text()\n    except:\n        print_exc(file=sys.stderr)\n        return text\n<\/pre>\n<p>For example, I want to scrape and parse books from <a href=\"http:\/\/jkrishnamurti.org\">J. Krishnamurti<\/a> so I will use the following parse functions to create my custom scraper.<\/p>\n<pre class=\"sh_python\">\n## custom parse functions\nnextchapter = re.compile('&lt;div id=\"chapter-forward\"&gt;&lt;a href=([^&gt;]*)&gt;',\n                          re.M | re.S).search\nparsetext = re.compile('&lt;div id=\"chapter-forward\"&gt;.*&lt;div class=\"clear\"&gt;' + \n                       '(.*)&lt;!-- box user preferences \/\/--&gt;s*&lt;div id=\"sidebar\"&gt;',\n                       re.M | re.S).search\n\n## create custom parser\/scraper\nget_jkrish_text = make_scraper(parsetext, nextchapter)\n<\/pre>\n<p>I can now use get_jkrish_txt(). Here it is downloading a shorter (9 chapter) book, <a href=\"http:\/\/jkrishnamurti.org\/krishnamurti-teachings\/view-text.php?tid=29&#038;chid=56860&#038;w=\">Flame of Attention<\/a><\/p>\n<pre class=\"sh_python\">\n>>> get_jkrish_txt(jkrish_url, 'flameofattention.txt')\nchapter 1 - view-text.php?tid=29&chid=56860&w=\nchapter 2 - view-text.php?tid=29&chid=56861&w=\nchapter 3 - view-text.php?tid=29&chid=56862&w=\nchapter 4 - view-text.php?tid=29&chid=56863&w=\nchapter 5 - view-text.php?tid=29&chid=56864&w=\nchapter 6 - view-text.php?tid=29&chid=56865&w=\nchapter 7 - view-text.php?tid=29&chid=56866&w=\nchapter 8 - view-text.php?tid=29&chid=56867&w=\nchapter 9 - view-text.php?tid=29&chid=56868&w=\n>>> \n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I would like to read J. Krishnamurti books on my Kindle. Unfortunately, no ebooks were available although I did find that jkrishnamurti.org has an extensive collection of books on their website. At present there is no full download, only a per-chapter html viewer, and some of the books ran over 80 chapters, which is more [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[11,6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/31"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=31"}],"version-history":[{"count":2,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/31\/revisions"}],"predecessor-version":[{"id":741,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/31\/revisions\/741"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=31"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=31"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=31"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}