I would like to create an HTML form that is resistant to bots. This could be a classic comment form or really any web-accessible form. In most cases, requiring authentication and an authorization service (such as OAuth) would be sufficient to protect a back-end web service. But what if you have a publicly accessible web form, and you want to keep it open and anonymous, but provide some measure of protection against automated bots?
For example, I would like to protect my URL Shortening Service such that only humans using the web front-end can access the back-end service layer, and I would like this layer of protection to be bot resistant.
Most spam-bots will crawl through webpages looking for HTML forms, and will seek too exploit any unprotected form handlers (i.e., the backend service). A more advanced bot can even leverage common authentication and authorization services. Other than annoying our users with CAPTCHA tests, what are some approaches that we can take?
Honeypot Form
A simple and effective approach is to create a dummy HTML form that is not rendered in the actual browser, but will be be discovered by most spam-bots. This could be as simple as,
<div id="__honey__"> <form action="post-comments.php" method="POST"> <input type="text" name="Name"> ... </form> </div>
The “post-comments.php” will be discovered by most spam-bots, yet in our case this is a misdirect. If you don’t want it to return a 404, you can add a dummy file named “post-comments.php” that does nothing but will always return a 200. In fact, anyone trying to access this form handler is likely a bot, so feel free to do whatever you want with this. A simple but fun trick is to include the request IP address (the IP address of the bot itself) as the form action,
<form action="http://{{request.remote_addr}}/post.php" method="POST">
A naive spam-bot may start scanning its own IP address and may even send HTTP POST messages to itself.
Meanwhile, we would add Javascript in <head> with a “defer” attribute,
<script src="PseudoForm.js" defer></script>
The “defer” attribute will execute the “PseudoForm.js” after the user-agent has parsed the DOM but before it has been rendered to the user. Inside this script, you can replace the fake honeypot form.
document.getElementById('__honey__').innerHTML = actual_form;
The “actual_form” can be anything you want, and to further confound spam-bots, it does not even need to be an actual HTML form.
PseudoForm
In the case of my URL Shortening Service, the actual form does not use an HTML <form> element, and instead uses a simple <button>. In fact, using proper CSS, you can use any element you want, and style it as a button, even something like this,
<div id="__honey__"> <input type="text" id="add_url" placeholder="Enter long URL..."> <i>+</i> </div>
There is no form handler in this case, and the back-end service is known only to the Javascript (which can add onclick event handlers to our CSS-styled button). This is sufficient to protect against naive spam-bots, but a more advanced spam-bot can easily parse our Javascript and determine actual service endpoints, such as looking for calls to XMLHttpRequest.
PseudoForm + formkey
In order to ensure the back-end service is accessed via our front-end, you can create a time-sensitive token. For my URL Shortening Service, all back-end service calls require a formkey. That is,
this.getformkey = function() { getJSON('/api/formkey', function(err, data) { if (err !== null) { console.log('Something went wrong: ' + err); } else { if ("formkey" in data) { _self.formkey = data.formkey; } } }); }
“getJSON” is a wrapper around XMLHttpRequest, and all this does is fetch a valid formkey and save it within the PseudoForm Javascript object. We can then keep formkey up to date in the webapp, like so,
setInterval( PseudoForm.getformkey, 60*1000 );
In the above case, PseudoForm.formkey will be refreshed every 60 seconds, and any calls to the back-end service must use this formkey. The value of formkey will appear as a nonsensical hex string, yet in-reality it is a time-sensitive hash that can only be verified by the back-end (which generated the formkeys in the first place), i.e.,
def getFormKey(self, ts=None, divsec=300): ''' return a time-sensitive hash that can be calculated at a later date to determine if it matches the hash will change every divsec seconds (on the clock) e.g., divsec=300, every 5-minutes the hash will change ''' if ts is None: ts = int(time.time()) return hashlib.sha1(( self._unicode(ts/divsec)+self.secret).encode('utf-8' )).hexdigest()[2:] def isValidFormKey(self, testkey, lookback=2, divsec=300): ''' return True IFF the testkey matches a hash for this or previous {lookback} iterations e.g., divsec=300, a formkey will be valid for at least 5 minutes (no less than 10) e.g., lookback=6, a formkey will valid for at least 25 minutes (no less than 30) ''' for i in range(lookback): if testkey == self.getFormKey(int(time.time())-divsec*i): return True return False
I included a secret key (by default a server generated GUID), to ensure that a bot would never be able to calculate a valid formkey on its own. All back-end services require a valid formkey, which means that any automated bot that accesses one of our back-end services must first fetch a valid formkey, and can only use that formkey within the time-limit set by the back-end itself (default is 5-minutes).
PseudoForm + handshake
Theoretically, an automated bot could be programmed to utilize a valid formkey, giving the appearance of a real human user accessing the front-end, and it could then spam the back-end service. This was exactly what I wanted to avoid in my URL Shortening Service. To further strengthen the PseudoForm, all back-end services require a time-sensitive handshake.
In other words, when you make a request to the URL Shortening Service, the response is a unique time-sensitive handshake. The handshake must be returned within a narrow time-window. If the response is too soon, the request is denied. If the response is late, the request is also denied. This has the added benefit of throttling the number of real transactions (roughly one per second) per IP address.
Therefore, in order for an automated bot to exploit the PseudoForm, it not only needs to bypass the honeypot, it needs to carefully time its HTTP calls to the back-end, throttling itself, and perfectly mimic what would happen seamlessly by a human on the front-end.
It’s certainly not impossible, and any software engineer of reasonable talent could build such a bot, but that bot would effectively be using the front-end. In the end, PseudoForm makes it very difficult for bots, throttling any bot to human speeds, but keeping everything very simple for a human (try it yourself, there are no CAPTCHAs or any nonsense).
PseudoForm example
My URL Shortening Service, x404, leverages exactly this kind of PseudoForm (including some of the code in this page). It’s a simple URL shortening service that was created as a novelty, but it provides a protected back-end service that is far more difficult to exploit than just manually using the front-end. In other words, it’s extremely easy for a human to use, and incredibly difficult for a bot.
The full process of the x404 PseudoForm is as follows,
1. create honeypot HTML form (misdirect naive spam-bots) 2. create a PseudoForm that appears as a normal web form in the browser (also prevent the honeypot form from rendering) 3. user enters information and clicks "add" The PseudoForm front-end will display a "processing" message to the user. The PseudoForm front-end will also do an HTTP POST to the PseudoForm back-end, and if it includes a valid formkey, then the PseudoForm back-end will respond with a unique handshake string. 4. wait a second... After a short wait (roughly 1s, but configurable however you want), the PseudoForm front-end will do an HTTP PUT to the PseudoForm back-end. If valid formkey, valid return handshake, and it is within the valid time window, then the back-end will commit the transaction and return success. Literally, the back-end will reject any return handshake (valid or not) that appears too soon (or too late). 5. success! user sees the results