Unicode and escape sequences

Unicode

Computers work internally with bits, represented as either 1 or 0. Characters as used in texts must be "translated" into sequences of bits, and vice versa. This is called character encoding. Character encoding uses a certain standard "encoding table". The dominant standard for consistent character encoding is Unicode. Unicode defines several transformation formats, including UTF-8 and UTF-16. The first Unicode characters in the "encoding table" are ASCII characters: the characters on a standard USA keyboard. Unicode extends the set of ASCII massively, covering characters from most alphabets, symbols, emoji, etc.

In Unicode the sequence of bits (a sequence of zeros and ones), associated with a character, is interpreted as a number. Such a number is called a code point. Code points are denoted as "U+" followed by the code point value represented as a hexadecimal number. For instance, U+03C9 is the code point for the Greek Small letter Omega (ω) and U+23F0 is the code point for an alarm clock character (⏰). This Wikipedia page provides a list of the most important character code points for English-language readers. This Wikipedia page provides a list of emoji code points 😵.

In JavaScript Unicode characters can be included in a string by simply typing them on a keyboard, copy-pasting them from an other text, using a special character picker (⊞ Win+. or Control ⌃+Command ⌘+Spacebar) etc. They can also be included by using escape sequences. The general escape sequences for Unicode code points in JavaScript is: \u{X...XXXXXX} where X...XXXXXX are 1 to 6 hexadecimal digits being the code point value.


console.log("\u{1F4A9}\u{0020}\u{0069}\u{0073}\u{0020}\u{0061}\u{0020}\u{0070}\u{0069}\u{006C}\u{0065}\u{0020}\u{006F}\u{0066}\u{0020}\u{0070}\u{006F}\u{006F}\u{002E}");

// logs: "💩 is a pile of poo."

JavaScript also provides shorter escape sequences for subsets of Unicode code points:

Format \uXXXX with exactly 4 hexadecimal digits. All code points within plane 0, aka the Basic Multilingual Plane (BMP) can be represented by this 4 hexadecimal digits format. BMP code points contain characters for almost all modern languages.
Format \xXX with exactly 2 hexadecimal digits. All ASCII code points (U+00 – U+7F) can be represented by this format. ASCII code points are the characters on a standard USA keyboard. Next to ASCII characters, a number of common characters (currency signs, fraction characters, punctuation characters, etc.) and a number of characters with diacritics or ligatures (ä, è, ç, Æ, ß, etc.) can be represented by this format.

Escape sequences \u{0000A3}, \u00A3 and \xA3 all represent the same code point.


console.log("\u23F0 is an alarm clock and \u03C9 is the Greek Small letter Omega.");
// logs: "⏰ is an alarm clock and ω is the Greek Small letter Omega."

console.log("\xA5 is the currency sign for the Japanese yen and \xA3 is the currency sign for the British pound.");
// logs: "¥ is the currency sign for the Japanese yen and £ is the currency sign for the British pound."

Code point vs glyph

It is not necessarily the case that all of the code points will appear on your screen. It depends on the availability of the glyph that represents a code point in the used typeface or font.

A code point describes a grapheme (a character). A glyph is a specific design that represents a particular grapheme, e.g. the letter "a". The glyph design is part of an overall typeface (font family) design. Most typefaces include variations in, for instance, weight (e.g. bold) or slope (e.g. italic). Each of these variations of the typeface is a font. The appearance, the glyph, of grapheme "a" varies from typeface to typeface and from font to font.

It is very well possible that a typeface or font misses glyphs for a number of code points. Using a code point in combination with a typeface or font that misses a glyph for this character will produce an empty space or some standard placeholder symbol.

In addition, emoji and other picture characters, are not processed via a typeface. Whether these code points appear (correctly) on your screen depends on the level of emoji implementation in your application. In the context of client-side JavaScript, it depends on the level of implementation in browsers. Also the appearance of the emoji may vary from browser to browser.

Whitespace and line terminators

Also characters that cannot be printed, like tabs, backspace, or a line feed have Unicode code points. In many formal languages, including regular expressions and JavaScript, these characters can be denoted by shorthand single character escape sequences.

In the next code sample a tab Unicode escape sequence (\u{0009}), a tab shorthand single character escape sequence (\t) and a tab by simply typing a tab via the Tab key on the keyboard are included in a string. All three do the same thing.


console.log("Next character\u{0009}is a tab."); // logs: "Next character	is a tab."
console.log("Next character\tis a tab."); // logs: "Next character	is a tab."
console.log("Next character	is a tab."); // logs: "Next character	is a tab."

See this list of escape sequences in JavaScript .

The "line terminators" carriage return, line feed and newline are closely associated:

Carriage return (U+000D):: Escape sequence: \r.
Return the cursor to the beginning of the current line without advancing downward.
Line feed (U+000A):: Escape sequence: \n.
Advance the cursor downward to the next line without returning to the beginning of the line.
Newline:: Escape sequence: \r\n.
A "next line" or a "line break": a carriage return and a line feed combined. Equivalent to pressing the ↵ Enter key.

Although historically line terminators are defined as listed above, various programming languages treat them in various ways. JavaScript treats \n (or \u{000A}) as a newline instead of a line feed. In JavaScript escape sequence \r\n serves as a redundant carriage return and a newline.

In JavaScript, a newline is equivalent to pressing the ↵ Enter key, although in a string literal an unescaped newline throws a SyntaxError. You can insert a newline escape sequence \n to insert an actual newline.


console.log("This is line 1
and this is line 2");
  // logs: SyntaxError: "" string literal contains an unescaped line break


console.log("This is line 1 \nand this is line 2");
  // logs:
  // "This is line 1 
  //  and this is line 2"

In a template literal an unescaped newline (↵ Enter key) is allowed and inserts an actual newline.


console.log(`This is line 1
and this is line 2`);

  // logs:
  // "This is line 1 
  //  and this is line 2"


console.log(`
` === "\n"); // logs: true

console.log(`
` === "\r"); // logs: false

console.log(`
` === "\r\n"); // logs: false

Escaping single characters

Quotation marks (", ') have special meaning in a string literal. They are not part of the actual string characters. They begin and end a string literal. The same thing holds for backslashes (\); they begin an escape sequences. They are not visible as such in the returned string. Escaping quotation marks (\', \") or the backslash (\\) turn them into string characters. Now the escaped character is taken literally as the character itself. This way quotation marks can be used in a string without ending the string and literally a backslash can be added to a string, e.g. to show an escape sequence.


console.log("This is what they call a \"string\", right?"); // logs: "This is what they call a "string", right?"
console.log('This is what they call a \'string\', right?'); // logs: "This is what they call a 'string', right?"

console.log("\\u{1F4A9} is the escape sequences for a pile of poo."); // logs: "\u{1F4A9} is the escape sequences for a pile of poo."

So, some characters escaped with a backslash are interpreted as escape sequences (\n, \t, \u, \x etc.). Any other character escaped with a backslash is interpreted as literally just that single character.


console.log("\H\e\l\l\o" === "Hello"); // logs: true
console.log("'" === "\'"); // logs: true
console.log("\Y\o\u"); // logs: SyntaxError: malformed Unicode character escape sequence // \u starts an escape sequence.
console.log("\Y\o\\u"); // logs: "Yo\u"

An escaped zero (\0), when not followed by a digit between 0 and 7, is interpreted as the null character (U+0000). In JavaScript this character (\0 or \u{0000}) has no meaning. In other languages it sometimes ends a string value.

Note that HTML and CSS each use their own different "escape sequence syntax" to include special Unicode characters. And HTML has its own (non-Unicode) representations for line termination and paragraph control.

When to use escape sequences?

When you write code (HTML, CSS, JavaScript), the text editor you write your code in should be set to encode in UTF-8, because this is the preferred encoding for the web. By far the most common encoding used on the World Wide Web is UTF-8 (> 98% of all web pages, as of 2021). In UTF-8 all Unicode characters can be encoded, so generally you can directly insert any Unicode character in your code. Only newline escape sequences must be escaped in string literals (see above). However, you can use Unicode escape sequences to insert any character. Keep in mind though that using escape sequences can make code more difficult to read and maintain, and also generally increase the file size (it includes more characters).


console.log("\x48\x65\x6C\x6C\x6F" === "Hello"); // logs: true

console.log("ω" === "\u03C9"); // logs: true
console.log("⏰ is an alarm clock and 💩 is a pile of poo."); // logs: "⏰ is an alarm clock and 💩 is a pile of poo."

Surrogate pairs

As mentioned before, all characters in plane 0, aka the Basic Multilingual Plane (BMP), can be escaped by using the format \uXXXX. They use a 16 bits representation, which is called a code unit. Historically in JavaScript every single code unit of 16 bits within a string represents one character.

But characters in the "higher" planes of Unicode ( "supplementary planes"), including mathematical symbols and emoji, have a code point value of 1 to 6 hexadecimal digits (U+XXXXXX). They need more than 16 bits up to 21 bits! To circumvent this, Unicode provides code point pairs within plane 0 to form references to characters of higher planes. These code point pairs are called surrogate pairs. A code point from the range U+D800–U+DBFF combined with one from the range U+DC00–U+DFFF addresses a character from a supplementary plane. The surrogate code points on their own have no meaning. Note that you can (and should) use the supplementary planes code points (using the \u{XXXXXX} format): surrogate pairs are used "under the hood".

A consequence is that these supplementary planes characters are represented by two code units, while every code unit represents a character in a string. This means, for example, that methods of the string wrapper object (see next chapter), like for instance length, slice() or substring(), may not work as expected when the string contains characters from supplementary planes.


console.log('\u{1F600}' === '😀'); // logs: true

console.log('😀' === '\uD83D\uDE00'); //  logs: true // two surrogate code points.   
console.log('😀'.length); // logs: 2  // One character but a length of 2...

let str = "😀 is a smiley";
console.log(str[0]); // logs: "�" // unprintable code unit
console.log(str[1]); // logs: "�" // unprintable code unit
console.log(str[3]); // logs: "i"
console.log(str[4]); // logs: "s"
// etc.

As a work-around we can use the Unicode-aware Array.from() method to create an array with each character (and not each code unit) as one element.


let smiley = "\u{1F600}";
let smileyArray = Array.from(smiley);
console.log(smileyArray.length); // logs: 1
console.log(smileyArray[0]); // logs: "😀"

Assembled characters 🤷🏾‍♀️

Unicode characters may also be assembled by putting together multiple code points:

Characters with diacritical marks (e.g. "ç" or "ä") may be constructed by a code point of the base character, directly followed by a diacritical mark code point, although most of these characters also have their own Unicode code point.
Symbols may be directly followed by a variation selector code point to "enhance" the symbol to an emoji appearance.
Emoji appearances may be changed by placing an emoji modifier code point right after the emoji, for instance, to change the skin tone of a face emoji.
Emoji may even be composed of more than two code points to form more extended emoji sequences.

All these Unicode sequence will be discussed in more detail in regular expressions. More sophisticated methods than Array.from() to distinguish the separate characters in a string, will be covered there as well. Array.from() is only aware of surrogate pairs, not of assembled characters.


let str, strArray;

str = "ä"; // "a\u{0308}" (There is also a code point for the whole character: "ä", "\u{00E4}")
strArray = Array.from(str);
console.log(str); // logs: "ä"
console.log(strArray); // logs: [ "a", "̈" ]
console.log(strArray.length); // logs: 2

str = "\u{1F467}+\u{1F3FF}→\u{1F467}\u{1F3FF}";
strArray = Array.from(str);
console.log(str); // logs: "👧+🏿→👧🏿"
console.log(strArray); // logs: [ "👧", "+", "🏿", "→", "👧", "🏿" ]
console.log(strArray.length); // logs: 6

str = "\u{1F468}+\u{1F469}+\u{1F467}→\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}";
strArray = Array.from(str);
console.log(str); // logs: "👨+👩+👧→👨‍👩‍👧"
console.log(strArray); // logs: [ "👨", "+", "👩", "+", "👧", "→", "👨", "‍", "👩", "‍", "👧" ]
console.log(strArray.length); // logs: 11

BTW: Emoji sequences may not always display as one would expect because of possible incomplete implementation in browsers. The level of implementation varies from browser to browser as well as the way the emoji are presented.