{"id":972,"date":"2019-05-22T11:32:42","date_gmt":"2019-05-22T11:32:42","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=972"},"modified":"2019-05-22T11:40:04","modified_gmt":"2019-05-22T11:40:04","slug":"punycode","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/punycode\/","title":{"rendered":"Punycode"},"content":{"rendered":"\n<p>I would like a webapp that supports UTF-8 URLs. For example, <span style=\"font-size: 120%;\">https:\/\/\u53bb.cc\/\u53fc<\/span>, where both the path and the server name contain non-ASCII characters.<\/p>\n\n\n\n<p>The path <span style=\"font-size: 120%;\">\/\u53fc<\/span> can be handled easily with %-encodings, e.g.,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> import urllib\n>>> \n>>> urllib.parse.quote('\/\u53fc')\n'\/%E5%8F%BC'<\/pre>\n\n\n\n<p>Note: this is similar to the raw byte representation of the unicode string:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> bytes('\/\u53fc', 'utf8')\nb'\/\\xe5\\x8f\\xbc'<\/pre>\n\n\n\n<p>However, the domain name, &#8220;\u53bb.cc&#8221; cannot be usefully %-encoded (that is, &#8220;%&#8221; is not a valid character in a hostname). The standard encoding for international domain names (IDN) is punycode; such that &#8220;\u53bb.cc&#8217; will look like &#8220;xn--1nr.cc&#8221;.<\/p>\n\n\n\n<p>The &#8220;xn--&#8221; prefix is the ASCII Compatible Encoding that essentially identifies this hostname as a punycode-encoded name. Most modern web-browsers and http libraries can decode this kind of name, although just in case, you can do something like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> \n>>> '\u53bb'.encode('punycode')\nb'1nr'<\/pre>\n\n\n\n<p>In practice, we can use the built-in &#8220;idna&#8221; encoding and decoding in python, i.e., IRI to URI:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> p = urllib.parse.urlparse('https:\/\/\u53bb.cc\/\u53fc')\n>>> p.netloc.encode('idna')\nb'xn--1nr.cc'\n>>> urllib.parse.quote(p.path)\n'\/%E5%8F%BC'\n<\/pre>\n\n\n\n<p>And going the other direction, i.e., URI to IRI:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> a = urllib.parse.urlparse('https:\/\/xn--1nr.cc\/%E5%8F%BC')\n>>> a.netloc.encode('utf8').decode('idna')\n'\u53bb.cc'\n>>> urllib.parse.unquote(a.path)\n'\/\u53fc'<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I would like a webapp that supports UTF-8 URLs. For example, https:\/\/\u53bb.cc\/\u53fc, where both the path and the server name contain non-ASCII characters. The path \/\u53fc can be handled easily with %-encodings, e.g., Note: this is similar to the raw byte representation of the unicode string: However, the domain name, &#8220;\u53bb.cc&#8221; cannot be usefully %-encoded [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/972"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=972"}],"version-history":[{"count":5,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/972\/revisions"}],"predecessor-version":[{"id":977,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/972\/revisions\/977"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=972"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=972"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=972"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}