New flags

About flags

XRegExp provides two new flags, which can be combined with native flags and arranged in any order. Unlike native flags, non-native flags do not show up as properties on regular expression objects.

Dot matches all (s)

The now abandoned ECMAScript 4 proposals called for recognizing the C1/Unicode NEL "next line" control character (\u0085) as an additional newline character in that standard.

Usually, dot does not match newlines. However, a mode in which dot matches newlines can be as useful as one where dot doesn't. The s flag allows the mode to be selected on a per-regex basis. Escaped dots and dots within character classes (e.g. [.a-z]) are always equivalent to literal dots. The newline characters are listed below:

Annotations

Free-spacing and line comments (x)

It might be better to think of whitespace and comments as do-nothing (rather than ignore-me) metacharacters. This distinction is important with something like \12 3, which with the x flag is taken as \12 followed by 3, and not \123. However, quantifiers following whitespace or comments apply to the preceeding token, so x + is equivalent to x+.

This flag has two complementary effects. First, it causes most whitespace to be ignored, so you can free-format the regex pattern for readability. Second, it allows comments with a leading #. Specifically, it turns most whitespace into an "ignore me" metacharacter, and # into an "ignore me, and everything else up to the next newline" metacharacter. They aren't taken as metacharacters within character classes (which means that classes are not free-format, even with x), and as with other metacharacters, you can escape whitespace and # that you want to be taken literally. Of course, you can always use \s to match whitespace.

ECMA-262 Edition 3 uses an interpretation of whitespace based on Unicode's Basic Multilingual Plane, from version 2.1 or later of the Unicode standard. Following are the characters that should be matched by \s according to ECMA-262 Edition 3 and Unicode 5.1:

JavaScript's \s is similar but not equivalent to \p{Z} from regex libraries that support Unicode properties, including XRegExp's own Unicode plugin. The difference is that \s includes characters \u0009\u000d, which are not assigned the Separator property in the Unicode character database.

Note that not all shorthand character classes and other JavaScript regex syntax is Unicode-aware. According to ECMA-262 Edition 3, \s, \S, ., ^, and $ use Unicode-based interpretations of whitespace and newline, while \d, \D, \w, \W, \b, and \B use ASCII-only interpretations of digit, word character, and word boundary (e.g. /a\b/.test("naïve") returns true). Actual browser implementations differ on these points. For example, Firefox 2 and lower considers \d and \D to be Unicode-aware, whereas Firefox 3 fixes this bug—making \d equivalent to [0-9] as with most other browsers.

To test which characters or positions are matched by the tokens just mentioned in your browser, see the JavaScript regex Unicode compatibility test.

Annotations

References and sources