{"id":1098,"date":"2019-09-13T11:15:24","date_gmt":"2019-09-13T11:15:24","guid":{"rendered":"http:\/\/tech.avant.net\/q\/?p=1098"},"modified":"2019-09-13T13:18:34","modified_gmt":"2019-09-13T13:18:34","slug":"extremely-large-numeric-bases-with-unicode","status":"publish","type":"post","link":"https:\/\/tech.avant.net\/q\/extremely-large-numeric-bases-with-unicode\/","title":{"rendered":"Extremely Large Numeric Bases with Unicode"},"content":{"rendered":"\n<p>Previously, we discussed using <a href=\"\/q\/punycode\/\">Punycode<\/a> for non-ASCII domain names with internationalized URLs, e.g., https:\/\/\u53bb.cc\/\u53fc<\/p>\n\n\n\n<p>I would like to use this approach to create a <a href=\"https:\/\/github.com\/timwarnock\/x404-Novelty-URL-Shortener\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\">URL Shortening Service<\/a>, where we can create shortened URLs that use UTF-8 characters in addition to the normal ASCII characters.<\/p>\n\n\n\n<p>Most URL shortening services use a base-62 alphanumeric key to map to a long-url. Typically, the base-62 characters include 26-uppercase letters (ABCD\u2026), 26 lowercase letters (abcd\u2026), and 10 digits (0123\u2026), for a total of 62 characters. Occasionally they will include an underscore or dash, bringing you to base-64. This is all perfectly reasonable when using ASCII and trying to avoid non-printable characters.<\/p>\n\n\n\n<p>For example, using base-62 encoding, you may have a typical short URL that looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">http:\/\/shorturl\/xp5zR2<\/pre>\n\n\n\n<p>However, nowadays with modern browsers supporting UTF-8, and offering basic rendering of popular Unicode character sets (\u4e2d\u6587, \ud55c\uae00, etc), we can leverage a global set of symbols!<\/p>\n\n\n\n<p>One of the larger contiguous Unicode ranges that has decent support on modern browsers is the initial <a rel=\"noreferrer noopener\" aria-label=\"CJK block (opens in a new tab)\" href=\"https:\/\/en.wikipedia.org\/wiki\/CJK_Unified_Ideographs\" target=\"_blank\">CJK block<\/a> as well as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hangul_Syllables\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"Korean Hangul syllables (opens in a new tab)\">Korean Hangul syllables<\/a>. Why are these interesting? Well, rather than a base-62 or base-64, we can use CJK and Hangul syllables to create extremely large numeric bases.<\/p>\n\n\n\n<p>The CJK range of <em>4e00<\/em> to <em>9fea<\/em> seems to be adequately supported, as well as the Hangul syllable range of <em>ac00<\/em> to <em>d7a3<\/em>, this would give us a base-20971 and a base-11172 respectively. Rather than base-62, we can offer numeric bases into the tens of thousands!<\/p>\n\n\n\n<p>This would allow shortened URLs to look like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">http:\/\/\u53bb.cc\/\u53fc\nhttp:\/\/\u53bb.cc\/\ub1fc<\/pre>\n\n\n\n<p>Taken to extremes, let&#8217;s consider a really large number, like 9,223,372,036,854,775,807 (nine quintillion two hundred  twenty-three quadrillion three hundred seventy-two trillion thirty-six  billion eight hundred fifty-four million seven hundred seventy-five  thousand eight hundred seven). This is the largest signed-64-bit integer on most systems. Let&#8217;s see what happens when we encode this number in extremely large bases:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">9223372036854775807\n= 7M85y0N8lZa\n= \u7459\u7a03\u7470\u8723\u4e2f\n= \uc141\ubefe\ubb8b\ub7e3\uae50<\/pre>\n\n\n\n<p>The CJK and Hangul encodings are 6-characters shorter than their  base-62 counterpart. For a URL Shortening Service, I&#8217;m not sure this will ever be useful. I&#8217;m not sure anyone will ever need to  map nine quintillion URLs. There aren&#8217;t that many URLs, but there  are billions of URLs. Let&#8217;s say we&#8217;re dealing with 88-billion URLs. In  that case let&#8217;s look at a more reasonable large number.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">88555222111\n= 3CG2Fy1\n= \u57f7\u6d2a\u4ec9\n= \ub2c1\uc75b\uaec5<\/pre>\n\n\n\n<p>NOTE: while the character-length of the Chinese string is less than the base-62 string, each of the Chinese characters represents 3-bytes in UTF-8. This will <strong>not<\/strong> save you bandwidth, although technically neither does ASCII, but it&#8217;s worth mentioning nonetheless.<\/p>\n\n\n\n<p>To convert a number to one of these Unicode ranges, you an use the following Python,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def encode_urange(n, ustart, uend):\n    chars = []\n    while n > 0:\n        n, d = divmod(n, uend-ustart)\n        chars.append(unichr(ustart+d))\n    return ''.join(chars)\n\ndef decode_urange(nstr, ustart, uend):\n    base = uend-ustart\n    basem = 1\n    n = 0\n    for c in unicode(nstr):\n        if uend > ord(c) &lt; ustart:\n            raise ValueError(\"{!r}, {!r} out of bounds\".format(nstr, c))\n        n += (ord(c)-ustart) * basem\n        basem = basem*base\n    return n<\/pre>\n\n\n\n<p>The CJK range is <em>4e00<\/em> to <em>9fea<\/em>, and you can map arbitrary CJK to base-10 as follows,<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"false\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">>>> \n>>> decode_urange(u'\u4f60\u597d\u4e16\u754c', int('4e00',16), int('9fea',16))\n92766958466352922\n>>>\n>>> print encode_urange(92766958466352922, int('4e00',16), int('9fea',16))\n\u4f60\u597d\u4e16\u754c\n<\/pre>\n\n\n\n<p>Unicode is full of fun and interesting character sets, here are some examples that I have built into <a href=\"https:\/\/github.com\/timwarnock\/x404-Novelty-URL-Shortener\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\"x404 (opens in a new tab)\">x404<\/a>:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># base-10   922,111\nbase62:     LSR3\ntop16:      eewwt\nCJK:        \u9db1\u4e2b\nhangul:     \uc3c9\uac52\nbraille:    \u28da\u2874\u2859\nanglosaxon: \u16e1\u16c7\u16de\u16bb\u16a2\ngreek:      \u03bf\u0392\u03a6\u03b4\nyijing:     \u4deb\u4dd4\u4deb\u4dc3<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Previously, we discussed using Punycode for non-ASCII domain names with internationalized URLs, e.g., https:\/\/\u53bb.cc\/\u53fc I would like to use this approach to create a URL Shortening Service, where we can create shortened URLs that use UTF-8 characters in addition to the normal ASCII characters. Most URL shortening services use a base-62 alphanumeric key to map [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6],"tags":[],"_links":{"self":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1098"}],"collection":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/comments?post=1098"}],"version-history":[{"count":10,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1098\/revisions"}],"predecessor-version":[{"id":1127,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/posts\/1098\/revisions\/1127"}],"wp:attachment":[{"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/media?parent=1098"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/categories?post=1098"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tech.avant.net\/q\/wp-json\/wp\/v2\/tags?post=1098"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}