I would like to read J. Krishnamurti books on my Kindle. Unfortunately, no ebooks were available although I did find that jkrishnamurti.org has an extensive collection of books on their website. At present there is no full download, only a per-chapter html viewer, and some of the books ran over 80 chapters, which is more than I am going to copy+paste into a text file.
I decided I’d use python and the HTMLParser library and write a throw-away parser. I realized parsing multi-page websites into text files might be useful for other purposes so I wrote a simple closure that requires two parse functions and returns a custom scraper function that will scrape all pages of an article or book and save it as a single plain text file.
One parse function must return the URL of the next webpage to scrape, and the other parse function must return the HTML of the readable text. Basically, pass in two parse functions (specific for whatever website you’re attempting to scrape) into the closure and it will return a scraper utilizing those parse functions.
from posixpath import basename, dirname from traceback import print_exc import urllib, StringIO, HTMLParser, re, sys def make_scraper(findtext, findnext): def scraper(inurl, outfilename): base_url = dirname(inurl) next_url = basename(inurl) chapter = 1 while True: html = urllib.urlopen(base_url + '/' + next_url).read() match = findtext(html) if match: print "chapter %s - %s" % (chapter, next_url) f = open(outfilename, 'a') f.write('nnCHAPTER %snn%s' % (chapter, dehtml(match.group(1))) ) f.close next_match = findnext(html) if next_match: next_url = next_match.group(1) chapter += 1 else: break else: break return scraper
The dehtml function is a very simple HTMLParser implementation that strips out all the HTML tags and maintains line and paragraph breaks.
class _DeHTMLParser(HTMLParser.HTMLParser): def __init__(self): HTMLParser.HTMLParser.__init__(self) self.__text = [] def handle_data(self, data): text = data.strip() if len(text) > 0: text = re.sub('[ trn]+', ' ', text) self.__text.append(text + ' ') def handle_starttag(self, tag, attrs): if tag == 'p': self.__text.append('nn') elif tag == 'br': self.__text.append('n') def handle_startendtag(self, tag, attrs): if tag == 'br': self.__text.append('nn') def text(self): return ''.join(self.__text).strip() def dehtml(text): try: parser = _DeHTMLParser() parser.feed(text) parser.close() return parser.text() except: print_exc(file=sys.stderr) return text
For example, I want to scrape and parse books from J. Krishnamurti so I will use the following parse functions to create my custom scraper.
## custom parse functions nextchapter = re.compile('<div id="chapter-forward"><a href=([^>]*)>', re.M | re.S).search parsetext = re.compile('<div id="chapter-forward">.*<div class="clear">' + '(.*)<!-- box user preferences //-->s*<div id="sidebar">', re.M | re.S).search ## create custom parser/scraper get_jkrish_text = make_scraper(parsetext, nextchapter)
I can now use get_jkrish_txt(). Here it is downloading a shorter (9 chapter) book, Flame of Attention
>>> get_jkrish_txt(jkrish_url, 'flameofattention.txt') chapter 1 - view-text.php?tid=29&chid=56860&w= chapter 2 - view-text.php?tid=29&chid=56861&w= chapter 3 - view-text.php?tid=29&chid=56862&w= chapter 4 - view-text.php?tid=29&chid=56863&w= chapter 5 - view-text.php?tid=29&chid=56864&w= chapter 6 - view-text.php?tid=29&chid=56865&w= chapter 7 - view-text.php?tid=29&chid=56866&w= chapter 8 - view-text.php?tid=29&chid=56867&w= chapter 9 - view-text.php?tid=29&chid=56868&w= >>>