I would like a webapp that supports UTF-8 URLs. For example, https://去.cc/叼, where both the path and the server name contain non-ASCII characters.
The path /叼 can be handled easily with %-encodings, e.g.,
>>> import urllib >>> >>> urllib.parse.quote('/叼') '/%E5%8F%BC'
Note: this is similar to the raw byte representation of the unicode string:
>>> bytes('/叼', 'utf8') b'/\xe5\x8f\xbc'
However, the domain name, “去.cc” cannot be usefully %-encoded (that is, “%” is not a valid character in a hostname). The standard encoding for international domain names (IDN) is punycode; such that “去.cc’ will look like “xn--1nr.cc”.
The “xn--” prefix is the ASCII Compatible Encoding that essentially identifies this hostname as a punycode-encoded name. Most modern web-browsers and http libraries can decode this kind of name, although just in case, you can do something like this:
>>> >>> '去'.encode('punycode') b'1nr'
In practice, we can use the built-in “idna” encoding and decoding in python, i.e., IRI to URI:
>>> p = urllib.parse.urlparse('https://去.cc/叼') >>> p.netloc.encode('idna') b'xn--1nr.cc' >>> urllib.parse.quote(p.path) '/%E5%8F%BC'
And going the other direction, i.e., URI to IRI:
>>> a = urllib.parse.urlparse('https://xn--1nr.cc/%E5%8F%BC') >>> a.netloc.encode('utf8').decode('idna') '去.cc' >>> urllib.parse.unquote(a.path) '/叼'