词条 | UTF-1 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
释义 |
| name = UTF-1 | mime = | alias = | image = | caption = | standard = | lang = International | status = Obscure, of mainly historical interest. | classification = Unicode Transformation Format, extended ASCII,{{efn|Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.}} variable-width encoding | encodes = ISO 10646 (Unicode) | extends = US-ASCII | prev = | next = UTF-8 | extra = {{notelist}} }} UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8. DesignUTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five bytes. The ASCII range is encoded as one byte (all code points from U+0000 to U+009F are). UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings, the bytes 0 - 0x20 or 0x7F - 0x9F always stand for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible. UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point. See also
References
1 : Unicode Transformation Formats |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。