Previously, we discussed using Punycode for non-ASCII domain names with internationalized URLs, e.g., https://去.cc/叼 I would like to use this approach to create a URL Shortening Service, where we can create shortened URLs that use UTF-8 characters in addition to the normal ASCII characters. Most URL shortening services use a base-62 alphanumeric key to map […]
python
PseudoForm, a Bot-resistant Web Form
I would like to create an HTML form that is resistant to bots. This could be a classic comment form or really any web-accessible form. In most cases, requiring authentication and an authorization service (such as OAuth) would be sufficient to protect a back-end web service. But what if you have a publicly accessible web […]
Trie or Set
Given a grid or input stream of characters, I would like to discover all words according to a given dictionary. This could be a dictionary of all English words or phrases (say, for an autocomplete service), or for any language. This is especially useful for languages where words are not clearly separated (e.g., Japanese, Chinese, […]
iter_words
I would like to iterate over a stream of words, say, from STDIN or a file (or any random input stream). Typically, this is done like this, And then one can simply, For a more concrete example, let’s say we need to keep a count of every unique word in an input stream, something like […]
Punycode
I would like a webapp that supports UTF-8 URLs. For example, https://去.cc/叼, where both the path and the server name contain non-ASCII characters. The path /叼 can be handled easily with %-encodings, e.g., Note: this is similar to the raw byte representation of the unicode string: However, the domain name, “去.cc” cannot be usefully %-encoded […]
Graph Search
I would like to discover paths between two nodes on a graph. Let’s say we have a graph that looks something like this: The graph object contains a collection of nodes and their corresponding connections. If it’s a bi-directional graph, those connections would have to appear in the corresponding sets (e.g., 1: set([2]) and 2: […]
python unittest
I would like to setup unit tests for a python application. There are many ways to do this, including doctest and unittest, as well as 3rd-party frameworks that leverage python’s unittest, such as pytest and nose. I found the plain-old unittest framework to be the easiest to work with, although I often run into questions […]
locking and concurrency in python, part 2
Previously, I created a “MultiLock” class for managing locks and lockgroups across a shared file system. Now I want to create a simple command-line utility that uses this functionality. To start, we can create a simple runone() function that leverages MutliLock, e.g., def _runone(func, lockname, lockgroup, basedir, *args, **kwargs): ”’ run one, AND ONLY ONE, […]
locking and concurrency in python, part 1
I would like to do file-locking concurrency control in python. Additionally, I would like to provide a “run-once-and-only-once” functionality on a shared cluster; in other words, I have multiple batch jobs to run over a shared compute cluster and I want a simple way to provide fault tolerance for parallel jobs. The batch jobs should […]
zip archive in python
I would like to create zip archives within a python batch script. I would like to compress individual files or entire directories of files. You can use the built-in zipfile module, and create a ZipFile as you would a normal File object, e.g., >>> >>> foo = zipfile.ZipFile(‘foo.zip’, mode=’w’) >>> foo.write(‘foo.txt’) >>> Unfortunately, by default […]
timeout command in python
I would like to add a timeout to any shell command such that if it does not complete within a specified number of seconds the command will exit. This would be useful for a any long-running command where I’d like it to die on its own rather than manually killing the long-running process. There are […]
python slice and sql every Nth row
I would like to retrieve every Nth row of a SQL table, and I would like this accessed via a python slice function. A python slice allows access to a list (or any object that implements a __getitem__ method) by a start, stop, and step — for example, >>> foo = range(100) >>> foo[5] 5 […]
python, finding recurring pairs of data
I would like to find all pairs of data that appear together at least 10 times. For example, given a large input file of keywords: >>> foo, bar, spam, eggs, durian, stinky tofu, … >>> fruit, meat, vinegar, sphere, foo, … >>> merlot, red, hearty, spam, oranges, durian, … >>> … “durian” and “spam” appear […]
python, analyzing csv files, part 2
Previously, we discussed analyzing CSV files, parsing the csv into a native python object that supports iteration while providing easy access to the data (such as a sum by column header). For very large files this can be cumbersome, especially where more advanced analytics are desired. I would like to keep the same simple interface […]
python, analyzing csv files, part 1
I would like to analyze a collection of CSV (comma-separated-values) files in python. Ideally, I would like to treat the csv data as a native python object. For example, >>> financial_detail = Report(‘financial-detail.csv’) >>> transactions = {} >>> for row in financial_detail: … transactions.append(row[‘Transaction’]) … >>> financial_detail.sum(‘Tax Amount’) Decimal(‘123456.10’) >>> Additionally, I would like to […]