| Overview [](https://travis-ci.org/lydell/js-tokens) |
| ======== |
| |
| A regex that tokenizes JavaScript. |
| |
| ```js |
| var jsTokens = require("js-tokens").default |
| |
| var jsString = "var foo=opts.foo;\n..." |
| |
| jsString.match(jsTokens) |
| // ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...] |
| ``` |
| |
| |
| Installation |
| ============ |
| |
| `npm install js-tokens` |
| |
| ```js |
| import jsTokens from "js-tokens" |
| // or: |
| var jsTokens = require("js-tokens").default |
| ``` |
| |
| |
| Usage |
| ===== |
| |
| ### `jsTokens` ### |
| |
| A regex with the `g` flag that matches JavaScript tokens. |
| |
| The regex _always_ matches, even invalid JavaScript and the empty string. |
| |
| The next match is always directly after the previous. |
| |
| ### `var token = matchToToken(match)` ### |
| |
| ```js |
| import {matchToToken} from "js-tokens" |
| // or: |
| var matchToToken = require("js-tokens").matchToToken |
| ``` |
| |
| Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type: |
| String, value: String}` object. The following types are available: |
| |
| - string |
| - comment |
| - regex |
| - number |
| - name |
| - punctuator |
| - whitespace |
| - invalid |
| |
| Multi-line comments and strings also have a `closed` property indicating if the |
| token was closed or not (see below). |
| |
| Comments and strings both come in several flavors. To distinguish them, check if |
| the token starts with `//`, `/*`, `'`, `"` or `` ` ``. |
| |
| Names are ECMAScript IdentifierNames, that is, including both identifiers and |
| keywords. You may use [is-keyword-js] to tell them apart. |
| |
| Whitespace includes both line terminators and other whitespace. |
| |
| [is-keyword-js]: https://github.com/crissdev/is-keyword-js |
| |
| |
| ECMAScript support |
| ================== |
| |
| The intention is to always support the latest ECMAScript version whose feature |
| set has been finalized. |
| |
| If adding support for a newer version requires changes, a new version with a |
| major verion bump will be released. |
| |
| Currently, ECMAScript 2018 is supported. |
| |
| |
| Invalid code handling |
| ===================== |
| |
| Unterminated strings are still matched as strings. JavaScript strings cannot |
| contain (unescaped) newlines, so unterminated strings simply end at the end of |
| the line. Unterminated template strings can contain unescaped newlines, though, |
| so they go on to the end of input. |
| |
| Unterminated multi-line comments are also still matched as comments. They |
| simply go on to the end of the input. |
| |
| Unterminated regex literals are likely matched as division and whatever is |
| inside the regex. |
| |
| Invalid ASCII characters have their own capturing group. |
| |
| Invalid non-ASCII characters are treated as names, to simplify the matching of |
| names (except unicode spaces which are treated as whitespace). Note: See also |
| the [ES2018](#es2018) section. |
| |
| Regex literals may contain invalid regex syntax. They are still matched as |
| regex literals. They may also contain repeated regex flags, to keep the regex |
| simple. |
| |
| Strings may contain invalid escape sequences. |
| |
| |
| Limitations |
| =========== |
| |
| Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be |
| perfect. But that’s not the point either. |
| |
| You may compare jsTokens with [esprima] by using `esprima-compare.js`. |
| See `npm run esprima-compare`! |
| |
| [esprima]: http://esprima.org/ |
| |
| ### Template string interpolation ### |
| |
| Template strings are matched as single tokens, from the starting `` ` `` to the |
| ending `` ` ``, including interpolations (whose tokens are not matched |
| individually). |
| |
| Matching template string interpolations requires recursive balancing of `{` and |
| `}`—something that JavaScript regexes cannot do. Only one level of nesting is |
| supported. |
| |
| ### Division and regex literals collision ### |
| |
| Consider this example: |
| |
| ```js |
| var g = 9.82 |
| var number = bar / 2/g |
| |
| var regex = / 2/g |
| ``` |
| |
| A human can easily understand that in the `number` line we’re dealing with |
| division, and in the `regex` line we’re dealing with a regex literal. How come? |
| Because humans can look at the whole code to put the `/` characters in context. |
| A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also |
| look backwards. See the [ES2018](#es2018) section). |
| |
| When the `jsTokens` regex scans throught the above, it will see the following |
| at the end of both the `number` and `regex` rows: |
| |
| ```js |
| / 2/g |
| ``` |
| |
| It is then impossible to know if that is a regex literal, or part of an |
| expression dealing with division. |
| |
| Here is a similar case: |
| |
| ```js |
| foo /= 2/g |
| foo(/= 2/g) |
| ``` |
| |
| The first line divides the `foo` variable with `2/g`. The second line calls the |
| `foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only |
| sees forwards, it cannot tell the two cases apart. |
| |
| There are some cases where we _can_ tell division and regex literals apart, |
| though. |
| |
| First off, we have the simple cases where there’s only one slash in the line: |
| |
| ```js |
| var foo = 2/g |
| foo /= 2 |
| ``` |
| |
| Regex literals cannot contain newlines, so the above cases are correctly |
| identified as division. Things are only problematic when there are more than |
| one non-comment slash in a single line. |
| |
| Secondly, not every character is a valid regex flag. |
| |
| ```js |
| var number = bar / 2/e |
| ``` |
| |
| The above example is also correctly identified as division, because `e` is not a |
| valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*` |
| (any letter) as flags, but it is not worth it since it increases the amount of |
| ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are |
| allowed. This means that the above example will be identified as division as |
| long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6 |
| characters long. |
| |
| Lastly, we can look _forward_ for information. |
| |
| - If the token following what looks like a regex literal is not valid after a |
| regex literal, but is valid in a division expression, then the regex literal |
| is treated as division instead. For example, a flagless regex cannot be |
| followed by a string, number or name, but all of those three can be the |
| denominator of a division. |
| - Generally, if what looks like a regex literal is followed by an operator, the |
| regex literal is treated as division instead. This is because regexes are |
| seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division |
| could likely be part of such an expression. |
| |
| Please consult the regex source and the test cases for precise information on |
| when regex or division is matched (should you need to know). In short, you |
| could sum it up as: |
| |
| If the end of a statement looks like a regex literal (even if it isn’t), it |
| will be treated as one. Otherwise it should work as expected (if you write sane |
| code). |
| |
| ### ES2018 ### |
| |
| ES2018 added some nice regex improvements to the language. |
| |
| - [Unicode property escapes] should allow telling names and invalid non-ASCII |
| characters apart without blowing up the regex size. |
| - [Lookbehind assertions] should allow matching telling division and regex |
| literals apart in more cases. |
| - [Named capture groups] might simplify some things. |
| |
| These things would be nice to do, but are not critical. They probably have to |
| wait until the oldest maintained Node.js LTS release supports those features. |
| |
| [Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html |
| [Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html |
| [Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html |
| |
| |
| License |
| ======= |
| |
| [MIT](LICENSE). |