词条 | UTF-7 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
释义 |
| name = UTF-7 | image = | mime = | alias = | standard = {{IETF RFC|2152}} | lang = International | encodes = Unicode | status = | prev = HZ-GB-2312 | next = UTF-8 over 8BITMIME | classification = Unicode Transformation Format, ASCII armor, variable-width encoding, stateful encoding }} UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable. UTF-7 is used by less than 0.003% of websites.[1] UTF-8 has since 2009 been the dominant encoding (of any kind, not just of Unicode encodings) for the World Wide Web (and declared mandatory "for all things" by WHATWG[2]). MotivationMIME, the modern standard of E-mail format, forbids encoding of headers using byte values above the ASCII range. Although MIME allows encoding the message body in various character sets (broader than ASCII), the underlying transmission infrastructure (SMTP, the main E-mail transfer standard) is still not guaranteed to be 8-bit clean. Therefore, a non-trivial content transfer encoding has to be applied in case of doubt. Unfortunately base64 has a disadvantage of making even US-ASCII characters unreadable in non-MIME clients. On the other hand, UTF-8 combined with quoted-printable produces a very size-inefficient format requiring 6–9 bytes for non-ASCII characters from the BMP and 12 bytes for characters outside the BMP. Provided certain rules are followed during encoding, UTF-7 can be sent in e-mail without using an underlying MIME transfer encoding, but still must be explicitly identified as the text character set. In addition, if used within e-mail headers such as "Subject:", UTF-7 must be contained in MIME encoded words identifying the character set. Since encoded words force use of either quoted-printable or base64, UTF-7 was designed to avoid using the = sign as an escape character to avoid double escaping when it is combined with quoted-printable (or its variant, the RFC 2047/1522 ?Q?-encoding of headers). UTF-7 is generally not used as a native representation within applications as it is very awkward to process. Despite its size advantage over the combination of UTF-8 with either quoted-printable or base64, the now defunct Internet Mail Consortium recommended against its use.[3] 8BITMIME has also been introduced, which reduces the need to encode message bodies in a 7-bit format. A modified form of UTF-7 (sometimes dubbed 'mUTF-7'{{CN|date=December 2015}}) is currently used in the IMAP e-mail retrieval protocol for mailbox names.[4] DescriptionUTF-7 was first proposed as an experimental protocol in RFC 1642, A Mail-Safe Transformation Format of Unicode. This RFC has been made obsolete by RFC 2152, an informational RFC which never became a standard. As RFC 2152 clearly states, the RFC "does not specify an Internet standard of any kind". Despite this, RFC 2152 is quoted as the definition of UTF-7 in the IANA's list of charsets. Neither is UTF-7 a Unicode Standard. The Unicode Standard 5.0 only lists UTF-8, UTF-16 and UTF-32. There is also a modified version, specified in RFC 2060, which is sometimes identified as UTF-7. Some characters can be represented directly as single ASCII bytes. The first group is known as "direct characters" and contains 62 alphanumeric characters and 9 symbols: Space, tab, carriage return and line feed may also be represented directly as single ASCII bytes. However, if the encoded text is to be used in e-mail, care is needed to ensure that these characters are used in ways that do not require further content transfer encoding to be suitable for e-mail. The plus sign ( Other characters must be encoded in UTF-16 (hence U+10000 and higher would be encoded into surrogates), big-endian (hence higher-order bits appear first), and then in modified Base64. The start of these blocks of modified Base64 encoded UTF-16 is indicated by a Examples
Algorithm for encoding and decodingEncodingFirst, an encoder must decide which characters to represent directly in ASCII form, which Using the £† (U+00A3 U+2020) character sequence as an example: {{ordered list|1= Express the character’s Unicode numbers (UTF-16) in Binary:{{unbulleted list |0x00A3 → 0000 0000 1010 0011 |0x2020 → 0010 0000 0010 0000}} |2= Concatenate the binary sequences: 0000 0000 1010 0011 and 0010 0000 0010 0000 → 0000 0000 1010 0011 0010 0000 0010 0000 |3= Regroup the binary into groups of six bits, starting from the left: 0000 0000 1010 0011 0010 0000 0010 0000 → 000000 001010 001100 100000 001000 00 |4= If the last group has fewer than six bits, add trailing zeros: 000000 001010 001100 100000 001000 00 → 000000 001010 001100 100000 001000 000000 |5= Replace each group of six bits with a respective Base64 code: 000000 001010 001100 100000 001000 000000 → AKMgIA }} DecodingFirst an encoded data must be separated into plain ASCII text chunks (including +es followed by a dash) and nonempty Unicode blocks as mentioned in the description section. Once this is done, each Unicode block must be decoded with the following procedure (using the result of the encoding example above as our example)
Unicode signatureA Unicode signature (often loosely called a "BOM") is an optional special byte sequence at the very start of a stream or file that, without being data itself, indicates the encoding used for the data that follows; a signature is used in the absence of metadata that denotes the encoding. For a given encoding scheme, the signature is that scheme's representation of Unicode code point While a Unicode signature is typically a single, fixed byte sequence, the nature of UTF-7 necessitates 5 variations: The last 2 bits of the 4th byte of the UTF-7 encoding of SecurityUTF-7 allows multiple representations of the same source string. In particular, ASCII characters can be represented as part of Unicode blocks. As such, if standard ASCII-based escaping or validation processes are used on strings that may be later interpreted as UTF-7, then Unicode blocks may be used to slip malicious strings past them. To mitigate this problem, systems should perform decoding before validation and should avoid attempting to autodetect UTF-7. Older versions of Internet Explorer can be tricked into interpreting the page as UTF-7. This can be used for a cross-site scripting attack as the References1. ^{{Cite web|url=https://w3techs.com/technologies/details/en-utf7/all/all|title=Usage Statistics of UTF-7 for Websites, December 2018|website=w3techs.com|language=en|access-date=2018-12-03}} 2. ^{{Cite web|url=https://encoding.spec.whatwg.org/#security-background|title=Encoding Standard|website=encoding.spec.whatwg.org|quote=The problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.|language=en|access-date=2018-11-15}} 3. ^{{Cite web|url=https://www.imc.org/imcr-010.html|title=Using International Characters in Internet Mail |work=Internet Mail Consortium |date=1 August 1998 |archive-url=https://web.archive.org/web/20150907234243/https://www.imc.org/imcr-010.html |archive-date=2015-09-07}} 4. ^RFC 3501 section 5.1.3 5. ^{{cite web|url=https://code.google.com/p/doctype-mirror/wiki/ArticleUtf7 |title=ArticleUtf7 - doctype-mirror - UTF-7: the case of the missing charset - Mirror of Google Doctype - Google Project Hosting |publisher=Code.google.com |date=2011-10-14 |accessdate=2012-06-29}} See also
2 : Character encoding|Unicode Transformation Formats |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。