“UTF-1”的意思、由来-开放百科全书

词条

UTF-1

释义

Design
See also
References

{{Infobox character encoding
| name = UTF-1
| mime =
| alias =
| image =
| caption =
| standard =
| lang = International
| status = Obscure, of mainly historical interest.
| classification = Unicode Transformation Format, extended ASCII,{{efn|Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.}} variable-width encoding
| encodes = ISO 10646 (Unicode)
| extends = US-ASCII
| prev =
| next = UTF-8
| extra =

}}

UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Design

UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five bytes. The ASCII range is encoded as one byte (all code points from U+0000 to U+009F are).

UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings, the bytes 0 - 0x20 or 0x7F - 0x9F always stand for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible.

UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 2⁶ = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

border="1" cellspacing="3" cellpadding="3" class="wikitable" style="font-family: monospace, monospace">

code point	UTF-8	UTF-1
U+007F	7F	7F
U+0080	C2 80	80
U+009F	C2 9F	9F
U+00A0	C2 A0	A0 A0
U+00BF	C2 BF	A0 BF
U+00C0	C3 80	A0 C0
U+00FF	C3 BF	A0 FF
U+0100	C4 80	A1 21
U+015D	C5 9D	A1 7E
U+015E	C5 9E	A1 A0
U+01BD	C6 BD	A1 FF
U+01BE	C6 BE	A2 21
U+07FF	DF BF	AA 72
U+0800	E0 A0 80	AA 73
U+0FFF	E0 BF BF	B5 48
U+1000	E1 80 80	B5 49
U+4015	E4 80 95	F5 FF
U+4016	E4 80 96	F6 21 21
U+D7FF	ED 9F BF	F7 2F C3
U+E000	EE 80 80	F7 3A 79
U+F8FF	EF A3 BF	F7 5C 3C
U+FDD0	EF B7 90	F7 62 BA
U+FDEF	EF B7 AF	F7 62 D9
U+FEFF	EF BB BF	F7 64 4C
U+FFFD	EF BF BD	F7 65 AD
U+FFFE	EF BF BE	F7 65 AE
U+FFFF	EF BF BF	F7 65 AF
U+10000	F0 90 80 80	F7 65 B0
U+38E2D	F0 B8 B8 AD	FB FF FF
U+38E2E	F0 B8 B8 AE	FC 21 21 21 21
U+FFFFF	F3 BF BF BF	FC 21 37 B2 7A
U+100000	F4 80 80 80	FC 21 37 B2 7B
U+10FFFF	F4 8F BF BF	FC 21 39 6E 6C
U+7FFFFFFF	FD BF BF BF BF BF	FD BD 2B B9 40

Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.

References

{{cite web |title=ISO IR 178: UCS Transformation Format One (UTF-1) |author=ISO/IEC JTC 1/SC2/WG2 |author-link=ISO/IEC JTC 1/SC2/WG2 |date=1993-01-21 |edition=1 |id=Registration number 178 |url=http://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf |type=PDF, 256 KB |dead-url=yes |archive-url=https://web.archive.org/web/20150318032101/http://kikaku.itscj.ipsj.or.jp/ISO-IR/178.pdf |archive-date=2015-03-18}}
{{cite web |author-first=Roman |author-last=Czyborra |title=Unicode Transformation Formats: UTF-8 & Co. |date=1998-11-30 |url=http://czyborra.com/utf/#UTF-1 |access-date=2016-06-07 |dead-url=no |archive-url=https://web.archive.org/web/20160607111732/http://czyborra.com/utf/#UTF-1 |archive-date=2016-06-07}}

1 : Unicode Transformation Formats

随便看

开放百科全书收录14589846条英语、德语、日语等多语种百科知识，基本涵盖了大多数领域的百科知识，是一部内容自由、开放的电子版国际百科全书。

Design

See also

References