Unicode escape sequences 💥️

The u flag and surrogate paired characters

ASCII based characters have shorthand expressions like \w, which is equivalent to [A-Za-z0-9_]. To match other Unicode characters we can use Unicode escape sequences. If a search involves surrogate paired characters, the u flag needs to be set. This makes the regex search Unicode-aware. Unlike in strings, in regular expressions 4-byte supplementary planes code points are treated as one would expect; as single characters.


const text = "Angle α measures π radians.";
//const regex = /[\u0391-\u03A9\u03B1-\03C9]/g; // this is equivalent to the next statement: 
const regex = /[Α-Ωα-ω]/g; // u flag not necessary 
console.log(text.match(regex)); // logs: [ "α", "π" ]

const moods = "happy 🙂, confused 😕, sad 😢";
//const regexpEmoticons = /[\u{1F600}-\u{1F644}]/gu; // this is equivalent to next statement: 
const regexpEmoticons = /[😀-🙄]/gu; // u flag is necessary (surrogate paired characters) 
console.log(moods.match(regexpEmoticons)); // logs: [ "🙂", "😕", "😢" ]

const regexpEmoticonsNoFlag = /[😀-🙄]/g; // no u flag
console.log(moods.match(regexpEmoticonsNoFlag)); // logs: Uncaught SyntaxError: invalid range in character class

Also when searching for characters that may contain 4-byte supplementary planes code points, the u flag needs to be set, like in the example below. Without the u flag, it returns the separate surrogate code points.


const text = "👽😈🐷";
const regex1 = /./g;
console.log(text.match(regex1)); // logs: [ "\ud83d", "\udc7d", "\ud83d", "\ude08", "\ud83d", "\udc37" ]

const regex2 = /./gu;
console.log(text.match(regex2)); // logs: [ "👽", "😈", "🐷" ]

Unicode properties

The Unicode Standard assigns various properties to each Unicode code point. These properties describe the code point. For instance, one property describes to what general category a code point belongs, like a letter, a symbol, a digit, a punctuation, etc. With subcategories it can be more specified, like lowercase letters, uppercase letters, math symbols, currency symbols, dash punctuations, opening bracket punctuations, etc. An other property describes to what alphabet, or more accurate, to what Script (Latin, Cyrillic, Hebrew, Arabic, Hanzi, Braille, etc.) a symbol belongs. Code points have many properties.

Unicode properties are very useful in regular expressions. Character classes like \w or \d only match alphanumeric characters from the Latin script ([A-Za-z0-9_]) or only decimal digits (Arabic numerals [0-9]). With Unicode properties you can specify much more generic or specify categories that are not covered by the standard character classes. With Unicode properties in regexes you can for instance search for letters from all languages or for instance only alphanumeric characters from the Cyrillic script.

To specify a Unicode property within a regex we use the Unicode property escapes \p{…} and \P{…}, where \P (uppercase P) is negated \p (lowercase p). The dots () represent some property or property value. When Unicode properties are included in a regex, the u flag needs to be present.

Unicode property values are either binary (true or false) or non-binary. With binary properties only the property name must be included to indicate the value "true". Non-binary values are specified with both name and value, although for values of General_Category (see below) the name may also be omitted. Both names and values may have (multiple) aliases or shorthands that can be used.

// Non-binary values:
\p{UnicodePropertyName=UnicodePropertyValue}
\p{UnicodePropertyValue} // alternative for General_Category values

// Binary values:
\p{UnicodeBinaryPropertyName}

// Negation:
\P{UnicodePropertyValue} // for General_Category values
\P{UnicodeBinaryPropertyName}
UnicodePropertyName:
There are only 3 non-binary property names:
  • General_Category (alias gc)
  • Script (alias sc)
  • Script_Extensions (alias scx)
UnicodeBinaryPropertyName:
There are a lot of binary property names. See this list of binary property names and their aliases.
UnicodePropertyValue:
Lists of non-binary property values and their aliases:

In Unicode a "Script" is a collection of letters, numerals and other characters to write text in some language(s). English and French for example use Latin script and Chinese and Japanese use Hanzi (Kanji) script. Scripts may include not only letters, but also numerals or specific (diacritic) marks and punctuation. General unified diacritical characters and general unified punctuation characters have the Common or Inherited script property value.

For some examples:


const text = "'你好' means 'Hello'.";

const regex1 = /\p{General_Category=Letter}/gu;
console.log(text.match(regex1)); // logs: [ "你", "好", "m", "e", "a", "n", "s", "H", "e", "l", … ]

const regex2 = /\p{gc=Letter}/gu;
console.log(text.match(regex2)); // logs: [ "你", "好", "m", "e", "a", "n", "s", "H", "e", "l", … ]

// It is not mandatory to use the property name for General_Category
const regex3 = /\p{Letter}/gu;
console.log(text.match(regex3)); // logs: [ "你", "好", "m", "e", "a", "n", "s", "H", "e", "l", … ]

// With alias:
const regex4 = /\p{L}/gu;
console.log(text.match(regex4)); // logs: [ "你", "好", "m", "e", "a", "n", "s", "H", "e", "l", … ]

// With negation:
const regex5 = /\P{L}/gu;
console.log(text.match(regex5)); // logs: [ "'", "'", " ", " ", "'", "'", "." ] // anything but letters, like punctuations and spaces.
                                 // equivalent to /[^\p{L}]/gu

const text = "The quadratic formula x = (−b±√(b²−4ac))/2a";
const regex = /[\p{Sm}\p{Ps}\p{Pe}\p{Po}]/gu; // Values for General_Category: math symbols, opening bracket punctuation, closing bracket punctuation, other punctuation.
console.log(text.match(regex)); // logs: [ "=", "(", "−", "±", "√", "(", "−", ")", ")", "/" ]

const text = "'έφαγα τον κόσμο να σε βρω!' is some strange Greek idiom.";
const regex = /\p{sc=Greek}/gu; // Script Greek
console.log(text.match(regex)); // logs: [ "έ","φ","α","γ","α","τ","ο","ν","κ","ό","σ","μ","ο","ν","α","σ","ε","β","ρ","ω ]

const text = "Three wise monkeys 🙈🙉🙊 turning a blind eye to evil👹🦉🐒😕.";
const regex = /\p{Emoji}/gu; // Binary property name "Emoji"
console.log(text.match(regex)); // logs: [ "🙈", "🙉", "🙊", "👹", "🦉", "🐒", "😕" ]

Combining diacritical marks

Most characters with a diacritical mark are available as single code points. But in Unicode letters and symbols can also be composed by adding a diacritic to a basic character. Unicode provides combining diacritical code points for this purpose (see Combining Diacritical Marks and Combining Diacritical Marks for Symbols). Like Unicode surrogate pairs, combining diacritics may lead to unexpected results.


const text1 = "garçon"; // "ç" (c-trencada or c-cedilla) as one code point U+00E7.
const text2 = "garçon"; // "ç" as a "c" and a combining diacritical cedilla mark U+0327.

console.log(`text2 counts ${text2.length} code points.`); // logs: "text2 counts 7 code points." // although "garçon" has 6 characters.
console.log(Array.from(text2)); // logs: [ "g", "a", "r", "c", "̧", "o", "n" ] // took "c" and the cedilla separately

console.log(text1.match(/\w/g)); // [ "g", "a", "r", "o", "n" ] // The "ç" is missing

console.log(text1.match(/\p{L}/gu)); // [ "g", "a", "r", "ç", "o", "n" ]
console.log(text2.match(/\p{L}/gu)); // [ "g", "a", "r", "c", "o", "n" ] // matched "c" without the cedilla

console.log(text2.match(/\w/g)); // [ "g", "a", "r", "c", "o", "n" ] // matched "c" without the cedilla
console.log(text2.match(/./g)); // [ "g", "a", "r", "c", "̧", "o", "n" ] // matched "c" and the cedilla separately

Fortunately the combining marks have their own General_Category property with value Mark (alias M). We can use this to create an array with all the actual characters (graphemes), with composed characters (characters combined with a diacritic) taken as one character.


const text1 = "garçon"; // "ç" as one code point U+00E7.
const text2 = "garçon"; // "ç" as a "c" and a combining diacritical cedilla mark U+0327.
const regex = /\P{M}\p{M}*/gu;

let textArray1 = text1.match(regex);
let textArray2 = text2.match(regex);

console.log(textArray1.length); // logs: 6
console.log(textArray1[3]); // logs: "ç"
console.log(textArray1); // logs: [ "g", "a", "r", "ç", "o", "n" ]

console.log(textArray2.length); // logs: 6
console.log(textArray2[3]); // logs: "ç"
console.log(textArray2); // logs: [ "g", "a", "r", "ç", "o", "n" ]

In the regex in the above example any character that is not a combining mark, combined with zero or more combining marks (\P{M}\p{M}*) is matched. The g flag is set, so it tries all characters, one by one.

In the examples above the "ç" is used as an example. In reality almost any user will insert the Unicode code point for the whole character, likely by using a shortcut keyboard combination (for the "ç": Alt+0231 in Windows or ⌥ Option+c on a Mac). Moreover, the used font may very well not support a correct rendering of character plus separate diacritical mark. Nevertheless, combining marks may be used in a text, maybe in complex scripts in which multiple combining diacritical marks may be added to one letter, hence the use of the asterisk (\P{M}\p{M}* instead of \P{M}\p{M}?).

Any character that is not a combining mark (\P{M}) means it also matches symbols, emoji, spaces, line terminators etc. This is appropriate to find all single characters in a string. If you except combining diacritical marks not to be matched, just /\P{M}/gu might be sufficient for matching all single characters. However, with emoji there are some issues (later more about this). Matching words needs a different approach. Later more about this.

Ligatures

In some scripts letters can be formed by fusing two or more other letters. Such merged letters are called ligatures, like Æ or Œ. Also the letter "W" originated as two joined V's or U's and the ampersand "&" originated as a ligature of the letters et, Latin for "and".

Unicode has two code points to establish ligature connections: the Zero Width Joiner (U+200D, often abbreviated as ZWJ) and the Zero Width Non-Joiner (U+200C, often abbreviated as ZWNJ). They are the only code points with property \p{Join_Control}. Most ligatures in Latin languages have their own Unicode code points, like the Dž (U+01C5), the æ (U+00E6) or the œ (U+0153). They cannot be formed by doing something like a\u{200D}e to get the æ. Some scripts however, like the Arabic and Indic scripts, require numerous stylistic ligatures that do require ZWJ and ZWNJ characters in Unicode encoding. Constructing regular expressions to correctly match all letters, including ligatures, may be quite a challenge, but then again, I am not familiar with Arabic and Indic scripts and their encoding.

Words and word boundaries

Matching words in a ASCII Latin text could be done with a regex like /\w+/g where \w matches characters [A-Za-z0-9_]. To match words in a text in any script, one could replace \w with a character class including:
Letters (\p{L}),
decimal digits (\p{Decimal_Number} or \p{Nd}),
connector punctuation (\p{Connector_Punctuation} or \p{Pc}),
combining marks (\p{Mark} or \p{M}) and
ZWJ and ZWNJ (\p{Join_Control}).
The regex then becomes: /[\p{L}\p{Nd}\p{Pc}\p{M}\p{Join_Control}]+/gu.


const test = "äũéçöä ۲_8B এই বাংলা"
console.log(test.match(/[\p{L}\p{Nd}\p{Pc}\p{M}\p{Join_Control}]+/gu)); // logs: [ "äũéçöä", "۲_8B", "এই", "বাংলা" ]

Matching a specific word, specific words or a specific letter combinations at the start or end of words generally need word boundaries. In a regular expression assertion \b matches a word boundary (and \B matches a non-word boundary). Like other assertions (^, $ and lookarounds), word boundary \b does not return a match, it only returns whether a match failed or succeeded. Word boundary \b matches between ^ and \w, between \W and \w, between \w and \W and between \w and $.

But \w only matches characters from the Latin script ([A-Za-z0-9_], \W matches [^A-Za-z0-9_]). For other or extended scripts, instead of \b, we can use negative lookaround assertions (?<!\p{L}) for a word boundary before a character and (?!\p{L}) for a word boundary after a character. However, as mentioned before, be aware of possible limited browser support of the lookbehind assertions.

Typically words are separated by single white spaces. However, using character class \s does not match between ^ and a word, and not between a word and $. Moreover, words are not exclusively separated by white spaces; there may also be punctuation marks (commas, points, exclamation marks, question marks) between or at the end of words.

In the next example a Finnish text with non-Latin characters is searched using word boundaries. It intends to search for all letters "ä" at the beginning of a word.


const str = "älkää olko mistään huolissanne";
console.log(str.match(/\sä/g)); // logs: null
console.log(str.match(/\bä/g)); // logs: [ "ä", "ä" ] // unintended match at index 3 and index 15.
console.log(str.match(/(?<!\p{L})ä/gu)); // logs: [ "ä" ] // intended match at index 0.

In the above example in the second regex, the second "ä" (index 3) and the fourth "ä" (index 15) are matched. The character "ä" is not a Latin word character (\w). So \b does not match between the beginning of the string and the first "ä". It does match between "ä" and "l", but "l" does not match ä. Assertion \b finds a next match between "k" and "ä", whereupon ä matches "ä" at index 3. Because the g is enabled, the search continues and the next "ä" is found at index 15 in a similar fashion. The third regex actually does what was intended: searching for characters "ä" right after a word boundary. In this text it matches the first character "ä" (index 0).

An other example:


const str = "Tämä härkä on härkäpää.";

const matches1 = str.matchAll(/(?<!\p{L})härkä(?!\p{L})/gu);
for (let match of matches1) {
   console.log(`'${match[0]}' found at index ${match.index}.`); // logs: "'härkä' found at index 5."
}
const matches2 = str.matchAll(/\bhärkä\b/g);
for (let match of matches2) {
   console.log(`'${match[0]}' found at index ${match.index}.`); // logs: "'härkä' found at index 14."
}

Method .matchAll() and the for loop will be explained in section Methods using regular expressions.

The negative lookaround assertions in the examples above look for characters that are not letters. This means that if there sits a mark, a number, or a connector punctuation right after or right before the word to search for, the word boundary does match. If this is not desirable, you can replace the \p{L} by [\p{L}\p{Nd}\p{Pc}\p{M}\p{Join_Control}] or a selection of these, depending on your application.


const text ="Tämä härkä on härkä_pää."
let matches;

matches = text.matchAll(/(?<!\p{L})härkä(?!\p{L})/gu);
for (let match of matches) {
   console.log(`'${match[0]}' found at index ${match.index}.`);
}
// logs:
// "'härkä' found at index 5."
// "'härkä' found at index 14."

matches = text.matchAll(/(?<!\p{L})härkä(?![\p{L}\p{Pc}])/gu);
for (let match of matches) {
   console.log(`'${match[0]}' found at index ${match.index}.`);
}
// logs: "'härkä' found at index 5."

Emoji modifiers and Emoji sequences

General category \p{So} covers all emoji, pictograph and symbol code points, except (some basic) mathematical symbols, currency symbols and modifiers (see later). There is also a binary property \p{Emoji}, but this property also includes some characters that are generally not considered emoji (like digits 0-9) until the are "enhanced" to an "emoji appearance" (later more about this).

But emoji may be composed of multiple code points (apart from surrogate pairs). Emoji can be combined with modifiers and even multiple emoji can be combined to compose one new emoji. Such combinations are called emoji sequences. Let's explore some of the emoji sequences.

First the variation sequence; a symbol character followed by a variation selector to define the kind of presentation. Some symbols have a text presentation by default. This means by default they have a simple monochrome representation, like a letter. Many of these symbols can be "enhanced" to a emoji-like presentation. Then there are the symbols that have a emoji presentation by default. Many of these emoji can be "simplified" to a text presentation. The presentation is changed by directly adding variation selector \u{FE0F} (chance to emoji style) or directly adding variation selector \u{FE0E} (chance to text style) after the symbol.


console.log("\u{2708}, \u{2708}\u{FE0F}"); // logs: "✈, ✈️" // (text, emoji)
console.log("\u{231B}, \u{231B}\u{FE0E}"); // logs: "⌛, ⌛︎" // (emoji, text)

const text = "\u{2708}\u{FE0F}, \u{231B}\u{FE0E}";
console.log(text); // logs: "✈️, ⌛︎" // (emoji, text)
console.log(text.match(/\p{So}/gu)); // logs: [ "✈", "⌛" ] // (text, emoji)
console.log(text.match(/\p{Emoji_Presentation}/gu)); // logs: [ "⌛" ] // (emoji) // Characters that have emoji presentation by default.
console.log(text.match(/\p{Emoji}\uFE0F|\p{Emoji}\uFE0E/gu)); // logs: [ "✈️", "⌛︎" ] // (emoji, text)
console.log(text.match(/\p{Emoji}\uFE0F?/gu)); // logs: [ "✈️", "⌛" ] // (emoji, emoji)
console.log(text.match(/\p{Emoji}\uFE0E?/gu)); // logs: [ "✈", "⌛︎" ] // (text, text)

Notice: In the Firefox browser the emoji above are presented as one would expect. Be aware that in browsers emoji combined with a variation selector may not always be displayed as one may expect because of possible incomplete implementation of this presentation feature. The level of implementation varies from browser to browser. Each comment in the example above ends with the presentation as it should be in parentheses, e.g., "(text, emoji)" for first symbol presented in text-style, second symbol presented in emoji-style.

Characters with property \p{Emoji} also include digits, the hash-tag and the asterisk. Why is this? Because also these characters can be "enhanced" to an emoji. The characters are the so-called keycap base characters; they can be combined with the combining enclosing keycap \u20E3.


console.log("#\u{FE0F}\u{20E3}, *\u{FE0F}\u{20E3}, 0\u{FE0F}\u{20E3}, 1\u{FE0F}\u{20E3}, 2\u{FE0F}\u{20E3} etc.");
// logs: #️⃣, *️⃣, 0️⃣, 1️⃣, 2️⃣ etc.

Unicode also defines a number of emoji, representing people or body parts, that can be modified to a number of different skin tones (and hair colors/styles). The base emoji to be modified must be directly followed by the modifier code point (with property \p{Emoji_Modifier}).


const text = "Combining the 2 code points \u{1F467} + \u{1F3FF} results in \u{1F467}\u{1F3FF}.";
console.log(text); // logs: "Combining the 2 code points 👧 + 🏿 results in 👧🏿."
console.log(text.match(/\p{So}/gu)); // logs: [ "👧", "👧" ]
console.log(text.match(/\p{Emoji}/gu)); // logs: [ "2", "👧", "🏿", "👧", "🏿" ]
console.log(text.match(/\p{So}\p{Emoji_Modifier}?/gu)); // logs: [ "👧", "👧🏿" ]

The binary character properties for emoji characters in a table:

PropertyAliasDescription
Emoji Emoji Characters that could be considered or potentially could be considered emoji.
Emoji_Presentation EPres Characters that have emoji presentation by default. A subset of \p{Emoji}.
Emoji_Modifier_Base EBase Characters that can serve as a base for emoji modifiers. A subset of \p{Emoji}.
Emoji_Modifier EMod Emoji modifiers
Emoji_Component EComp Code points in emoji sequences that separately would not be considered emoji, such as keycap base characters (#,*,0...9) or regional indicator symbols (poor browser support). Emoji_Component also includes the modifiers, a variation selector (only VARIATION SELECTOR-16 \u{FE0F}) and the zero width joiner (ZWJ, \u{200D}). All characters in emoji sequences are Emoji, both Emoji and Emoji_Component, or only Emoji_Component (like the ZWJ).
Extended_Pictographic ExtPict Characters that are used to future-proof segmentation. The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Component characters, which may make it an alternative for the general category property \p{So} as used in the regular expressions in this section.

Source: UTS #51: Unicode Emoji, table: Emoji Character Properties

In the next example a regular expression is used to find all emoji, pictographs and symbols as we have covered so far (except the keycaps).


const regex = /\p{So}[\u{FE0F}\u{FE0E}\p{Emoji_Modifier}]?/gu;

const text1 = "\u{1F467}, \u{1F467}\u{1F3FF}, \u{2708}, \u{2708}\u{FE0F}, \u{231B}, \u{231B}\u{FE0E}";
const text2 = `\
👧: Emoji (Emoji_Modifier_Base)
👧🏿: Emoji followed by a modifier (Emoji_Modifier_Base + Emoji_Modifier)
✈: Emoji 
✈️: Emoji followed by VARIATION SELECTOR-16 (U+FE0F)
⌛: Emoji (Emoji_Presentation)
⌛︎: Emoji followed by VARIATION SELECTOR-15 (U+FE0E) \
`;

console.log(text1.match(regex)); // logs: [ "👧", "👧🏿", "✈", "✈️", "⌛", "⌛︎" ]
console.log(text2.match(regex)); // logs: [ "👧", "👧🏿", "✈", "✈️", "⌛", "⌛︎" ]

The keycaps are not captured by the above regex. If you want to include them, you can use a regex like this: /(\p{So}[\u{FE0F}\u{FE0E}\p{Emoji_Modifier}]?)|(\p{Emoji}\u{FE0F}\u{20E3})/gu.

Then there are the more extended emoji sequences of multiple Emoji code points combined with Emoji_Component code points. An emoji sequence like this defines a new individual emoji. See this emoji list for the Unicode emoji characters and sequences and this emoji modifier sequences list for all emoji sequences with skin-tones and hair colors/styles.


console.log(`\u{1F469}, \u{1F3FF}, \u{2708} → \
\u{1F469}\u{1F3FF}\u{200D}\u{2708}\u{FE0F}`); // logs: "👩, 🏿, ✈ → 👩🏿‍✈️" // a female pilot with dark skin tone
// notice the zero width joiner (\u{200D}) and VARIATION SELECTOR-16 (\u{FE0F}).

console.log(`\u{1F468}\u{1F469}\u{1F467}\u{1F467} → \
\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F467}`); // logs: "👨👩👧👧 → 👨‍👩‍👧‍👧"

console.log(`\u{1F468}\u{1F469}\u{1F466}\u{1F466} → \
\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F466}\u{200D}\u{1F466}`); // logs: "👨👩👦👦 → 👨‍👩‍👦‍👦"

const emoji = "\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}";
console.log(`The length of ${emoji} is ${emoji.length}.`); // logs: "The length of 👨‍👩‍👧‍👦 is 11."
console.log(emoji.match(/[\p{Emoji}\p{Emoji_Component}]/gu).length); // logs: 7
console.log(emoji.match(/\p{So}[\uFE0F\uFE0E\p{Emoji_Modifier}]?/gu).length); // logs: 4

Notice: Also these more extended emoji sequences may not always display as one would expect because of possible incomplete implementation in browsers. The level of implementation varies from browser to browser as well as the way the emoji are presented.

Matching these kind of extended emoji sequences with a general regular expression, when also non-sequence emoji need to be matched, is impossible (right?). These sequences do not have a general pattern and emoji (sequences or not) can follow right after each other. So we'll just leave it with that...

Internationalization

Figure 1 shows a table with some general regular expressions using Unicode escape sequences. These regular expressions match characters and words in texts not limited to written in the (ASCII) Latin script, but covering many other scripts, as well as emoji, pictographs and symbols. This table is composed of regular expressions that have been described in detail in the previous chapters. The example "reverse a string" in Appendix A handles an application of one of the regexes.

Not only with matching characters and words, but internationalization of regular expressions in general can really be a pain. For example, internationally gathered telephone numbers, addresses (house numbers, postal codes) or dates and times may be in a variety of formats. Even simply numbers may be of different formats given the various forms of digit grouping and decimal marks used in the world.

Regular expressionDescription
/[^\p{M}\u{FE0F}\u{FE0E}\p{Emoji_Modifier}\p{Join_Control}][\u{FE0F}\u{FE0E}\p{Emoji_Modifier}]?/gu
Matches all characters, incl. most emoji, pictographs and symbols, in a string.
Shortcomings: It will not match combining diacritical marks and ligature sequences. It will not match extended emoji sequences of multiple Emoji/Emoji_Component code points as one combined character.
/[^\p{M}\u{FE0F}\u{FE0E}\p{Emoji_Modifier}\p{Join_Control}]/gu
Matches all single characters, incl. simple one code point emoji, pictographs and symbols, in a string.
Shortcomings: It will not match combining diacritical marks and ligature sequences. It will not match Emoji sequences (no Emoji modifiers, variation selectors, keycaps, ZWJ, etc.) as one character.
/\p{So}[\u{FE0F}\u{FE0E}\p{Emoji_Modifier}]?/gu
Matches emoji, pictographs and symbols in a string.
Shortcomings: It will not match extended emoji sequences of multiple Emoji/Emoji_Component code points as one combined character.
/[\p{L}\p{Nd}\p{Pc}\p{M}\p{Join_Control}]+/gu
Matches all words in a string.
Shortcomings: It will not match symbols and emoji. It may match parts of emoji sequences like variation selectors, ZWJ and keycaps.
/(?<![\p{L}\p{Nd}\p{Pc}\p{M}])myWord(?![\p{L}\p{Nd}\p{Pc}\p{M}])/gu
Matches all words myWord in a string.
Shortcomings: It will not match symbols and emoji. It may match parts of emoji sequences like variation selectors and keycaps. Possible limited browser support of the lookbehind assertions.
Fig.1 - Regular expressions for matching characters and words in a text. The characters are not limited to (ASCII) Latin script characters. They cover many other scripts, as well as emoji, pictographs and symbols. However, even these regular expressions have shortcomings and may not be suitable for some scripts. Specific scripts may need specific regular expressions. They also are not able to match the most complex emoji sequences in a correct way, as no regular expression is.