请输入您要查询的百科知识:

 

词条 Escape sequences in C
释义

  1. Motivation

  2. Table of escape sequences

     Notes  Non-standard escape sequences  Universal character names 

  3. See also

  4. References

{{Use American English|date = March 2019}}{{Short description|escape characters and related in the C programming language}}{{more footnotes |date=September 2013}}

Escape sequences are used in the programming languages C and C++, and their design was copied in many other languages such as Java and C#. An escape sequence is a sequence of characters that does not represent itself when used inside a character or string literal, but is translated into another character or a sequence of characters that may be difficult or impossible to represent directly.

In C, all escape sequences consist of two or more characters, the first of which is the backslash, {{mono|\\}} (called the "Escape character"); the remaining characters determine the interpretation of the escape sequence. For example, {{mono|\}} is an escape sequence that denotes a newline character.

Motivation

Suppose we want to print out {{mono|Hello,}} on one line, followed by {{mono|world!}} on the next line. One could attempt to represent the string to be printed as a single literal as follows:

  1. include

int main() {

world!");

}

This is not valid in C, since a string literal may not span multiple logical source lines. This can be worked around by printing the newline character using its numerical value ({{mono|0x0a}} in ASCII),

  1. include

int main() {

}

This instructs the program to print {{mono|Hello,}}, followed by the byte whose numerical value is {{mono|0x0a}}, followed by {{mono|world!}}. While this will indeed work when the machine uses the ASCII encoding, it will not work on systems that use other encodings, that have a different numerical value for the newline character. It is also not a good solution because it still does not allow us to represent a newline character inside a literal, and instead takes advantage of the semantics of printf. In order to solve these problems and ensure maximum portability between systems, C interprets {{mono|\}} inside a literal as a newline character, whatever that may be on the target system:

  1. include

int main() {

}

In this code, the escape sequence {{mono|\}} does not stand for a backslash followed by the letter {{mono|n}}, because the backslash causes an "escape" from the normal way characters are interpreted by the compiler. After seeing the backslash, the compiler expects another character to complete the escape sequence, and then translates the escape sequence into the bytes it is intended to represent. Thus, {{mono|"Hello,\world!"}} represents a string with an embedded newline, regardless of whether it is used inside {{mono|printf}} or anywhere else.

This raises the issue of how to represent an actual backslash inside a literal. This is done by using the escape sequence {{mono|\\\\}}, as seen in the next section.

It should be noted that some languages don't have escape sequences, for example Pascal. Instead a command including a newline would be used (writeln includes a newline, write excludes it).

writeln('Hello');

write('world!');

Table of escape sequences

The following escape sequences are defined in standard C. This table also shows the values they map to in ASCII. However, these escape sequences can be used on any system with a C compiler, and may map to different values if the system does not use a character encoding based on ASCII.

Escape sequence Hex value in ASCII Character represented
\\a}} 07 Alert (Beep, Bell) (added in C89)[1]
\\b}} 08 Backspace
\\e}}{{Ref|Note2|note 2}} 1B escape character
\\f}} 0C Formfeed Page Break
\}} 0A Newline (Line Feed); see notes below
\\r}} 0D Carriage Return
\\t}} 09 Horizontal Tab
\\v}} 0B Vertical Tab
\\\\}} 5C Backslash
\\'}} 27 Apostrophe or single quotation mark
\\"}} 22 Double quotation mark
\\?}} 3F Question mark (used to avoid trigraphs)
\\nnn}}{{Ref|Note1|note 1}} any The byte whose numerical value is given by nnn interpreted as an octal number
\\xhh…}} any The byte whose numerical value is given by hh… interpreted as a hexadecimal number
\\uhhhh}}{{Ref|Note4|note 3}} none Unicode code point below 10000 hexadecimal
\\Uhhhhhhhh}}{{Ref|Note3|note 4}} none Unicode code point where h is a hexadecimal digit

Note 1.{{Note|Note1}}There may be one, two, or three octal numerals n present; see the Notes section below.

Note 2.{{Note|Note2}}Common non-standard code; see the Notes section below.

Note 3.{{Note|Note3}}\\u takes 4 hexadecimal digits h; see the Notes section below.

Note 4.{{Note|Note4}}\\U takes 8 hexadecimal digits h; see the Notes section below.

Notes

Each escape sequence in the above table maps to a single byte, including {{mono|\}}. This is despite the fact that the platform may use more than one byte to denote a newline, such as the DOS/Windows CR-LF sequence, {{mono|0x0d 0x0a}}. The translation from {{mono|0x0a}} to {{mono|0x0d 0x0a}} on DOS and Windows occurs when the byte is written out to a file or to the console, but {{mono|\}} only creates a single byte within the memory of the program itself.

A hex escape sequence must have at least one hex digit following {{mono|\\x}}, with no upper bound; it continues for as many hex digits as there are. Thus, for example, {{mono|\\xABCDEFG}} denotes the byte with the numerical value ABCDEF16, followed by the letter {{mono|G}}, which is not a hex digit. However, if the resulting integer value is too large to fit in a single byte, the actual numerical value assigned is implementation-defined. Most platforms have 8-bit {{mono|char}} types, which limits a useful hex escape sequence to two hex digits. However, hex escape sequences longer than two hex digits might be useful inside a wide character or wide string literal(prefixed with L):

char s1[] = "\\x12"; // single char with value 0x12 (18 in decimal)

char s1[] = "\\x1234"; // single char with implementation-defined value, unless char is long enough

wchar_t s2[] = L"\\x1234"; // single wchar_t with value 0x1234, provided wchar_t is long enough (16 bits suffices)

An octal escape sequence consists of {{mono|\\}} followed by one, two, or three octal digits. The octal escape sequence ends when it either contains three octal digits already, or the next character is not an octal digit. For example, {{mono|\\11}} is a single octal escape sequence denoting a byte with numerical value 9 (11 in octal), rather than the escape sequence {{mono|\\1}} followed by the digit {{mono|1}}. However, {{mono|\\1111}} is the octal escape sequence {{mono|\\111}} followed by the digit {{mono|1}}. In order to denote the byte with numerical value 1, followed by the digit {{mono|1}}, one could use {{mono|"\\1""1"}}, since C automatically concatenates adjacent string literals. Note that some three-digit octal escape sequences may be too large to fit in a single byte; this results in an implementation-defined value for the byte actually produced. The escape sequence {{mono|\\0}} is a commonly used octal escape sequence, which denotes the null character, with value zero.

Non-standard escape sequences

A sequence such as {{Mono|\\z}} is not a valid escape sequence according to the C standard as it is not found in the table above. The C standard requires such "invalid" escape sequences to be diagnosed (i.e., the compiler must print an error message). Notwithstanding this fact, some compilers may define additional escape sequences, with implementation-defined semantics. An example is the {{Mono|\\e}} escape sequence, which has 1B as the hexadecimal value in ASCII, represents the escape character, and is supported in GCC,[2] clang and tcc. It wasn't however added to the C standard repertoire, because it has no meaningful equivalent in some character sets (such as EBCDIC).[1]

Universal character names

From the C99 standard, C has also supported escape sequences that denote Unicode code points in string literals. Such escape sequences are called universal character names, and have the form {{mono|\\uhhhh}} or {{mono|\\Uhhhhhhhh}}, where {{mono|h}} stands for a hex digit. Unlike the other escape sequences considered, a universal character name may expand into more than one code unit.

The sequence {{mono|\\uhhhh}} denotes the code point {{mono|hhhh}}, interpreted as a hexadecimal number. The sequence {{mono|\\Uhhhhhhhh}} denotes the code point {{mono|hhhhhhhh}}, interpreted as a hexadecimal number. (Therefore, code points located at U+10000 or higher must be denoted with the {{mono|\\U}} syntax, whereas lower code points may use {{mono|\\u}} or {{mono|\\U}}.) The code point is converted into a sequence of code units in the encoding of the destination type on the target system. For example, consider

char s1[] = "\\xC0";

char s2[] = "\\u00C1";

wchar_t s3[] = L"\\xC0";

wchar_t s4[] = L"\\u00C0";

The string {{mono|s1}} will contain a single byte (not counting the terminating null) whose numerical value, the actual value stored in memory, is in fact {{mono|0xC0}}. The string {{mono|s2}} will contain the character "Á", U+00C1 {{sc2|LATIN CAPITAL LETTER A WITH ACUTE}}. On a system that uses the UTF-8 encoding, the string {{mono|s2}} will contain two bytes, {{mono|0xC3 0xA1}}. The string {{mono|s3}} contains a single {{mono|wchar_t}}, again with numerical value {{mono|0xC0}}. The string {{mono|s4}} contains the character "À" encoded into {{mono|wchar_t}}, if the UTF-16 encoding is used, then {{mono|s4}} will also contain only a single {{mono|wchar_t}}, 16 bits long, with numerical value {{mono|0x00C0}}. A universal character name such as {{mono|\\U0001F603}} may be represented by a single {{mono|wchar_t}} if the UTF-32 encoding is used, or two if UTF-16 is used.

Importantly, the universal character name {{mono|\\u00C0}} always denotes the character "À", regardless of what kind of string literal it is used in, or the encoding in use. Again, {{mono|\\U0001F603}} always denotes the character at code point 1F60316, regardless of context. On the other hand, octal and hex escape sequences always denote certain sequences of numerical values, regardless of encoding. Therefore, universal character names are complementary to octal and hex escape sequences; while octal and hex escape sequences represent "physical" code units, universal character names represent code points, which may be thought of as "logical" characters.

See also

  • Escape sequence
  • Digraph

References

  • ISO/IEC 9899:1999, Programming languages — C
  • {{cite book |last1=Kernighan|authorlink1=Brian Kernighan |first1=Brian W. |last2=Ritchie|authorlink2=Dennis Ritchie |first2=Dennis M. |year=1988 |title=The C Programming Language |publisher=Prentice Hall |isbn=9780133086218}}
  • {{cite book |last=Lafore |first=Robert|authorlink=Robert Lafore |year=2001 |title=Object-Oriented Programming in Turbo C++ |publisher=Galgotia Publications |isbn=9788185623221}}
1. ^{{cite web |title=Rationale for International Standard - Programming Languages - C |version=5.10 |date=April 2003 |url=http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf |access-date=2010-10-17 |dead-url=no |archive-url=https://web.archive.org/web/20160606072228/http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf |archive-date=2016-06-06}}
2. ^{{citation |title=GCC 4.8.2 Manual |chapter=6.35 The Character in Constants |url=https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/Character-Escapes.html#Character-Escapes |access-date=2014-03-08}}

2 : C (programming language)|Control characters

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/15 15:16:12