Esprima tokenizer takes a string as an input and produces an array of tokens, a list of object representing categorized input characters. This is known as lexical analysis.
The interface of the tokenize
function is as follows:
esprima.tokenize(input, config)
where
input
is a string representing the program to be tokenizedconfig
is an object used to customize the parsing behavior (optional)The input
argument is mandatory. Its type must be a string, otherwise the tokenization behavior is not determined.
The description of various properties of config
is summarized in the following table:
An example Node.js REPL session that demonstrates the use of Esprima tokenizer is:
$ node > var esprima = require('esprima') > esprima.tokenize('answer = 42') [ { type: 'Identifier', value: 'answer' }, { type: 'Punctuator', value: '=' }, { type: 'Numeric', value: '42' } ]
In the above example, the input string is tokenized into 3 tokens: an identifier, a punctuator, and a number. For each token, the type
property is a string indicating the type of the token and the value
property stores the corresponding the lexeme, i.e. a string of characters which forms a syntactic unit.
Unlike the parse
function, the tokenize
function can work with an input string that does not represent a valid JavaScript program. This is because lexical analysis, as the name implies, does not involve the process of understanding the syntactic structure of the input.
$ node > var esprima = require('esprima') > esprima.tokenize('42 = answer') [ { type: 'Numeric', value: '42' }, { type: 'Punctuator', value: '=' }, { type: 'Identifier', value: 'answer' } ] > esprima.tokenize('while (if {}') [ { type: 'Keyword', value: 'while' }, { type: 'Punctuator', value: '(' }, { type: 'Keyword', value: 'if' }, { type: 'Punctuator', value: '{' }, { type: 'Punctuator', value: '}' } ]
By default, each token in the array returned by the tokenizer only has two properties, the type of the token and the lexeme. For some use cases, the location of each token needs to be known as well (e.g. to offer a meaningful feedback to the user). Esprima tokenizer can add that location information to each token in two forms, zero-based range and line-column location. This is done by customizing the tokenization process with the configuration object.
Setting range
(in the configuration object) to true adds a new property, range
, to each token. It is an array of two elements, each indicating the zero-based index of the starting and end location (exclusive) of the token. A simple example follows:
$ node > var esprima = require('esprima') > esprima.tokenize('answer = 42', { range: true }) [ { type: 'Identifier', value: 'answer', range: [ 0, 6 ] }, { type: 'Punctuator', value: '=', range: [ 7, 8 ] }, { type: 'Numeric', value: '42', range: [ 9, 11 ] } ]
In the above example, the starting and end location of each token can be determined from its range property. For instance, the equal sign (=
) is the 7th character in the input string, because its range is [7, 8]
.
Setting loc
to true adds a new property, loc
, to each token. It is a object that contains the line number and column number of the starting and end location (exclusive) of the token. This is illustrated in the example:
$ node > var esprima = require('esprima') > tokens = esprima.tokenize('answer = 42', { loc: true }); > tokens[2] { type: 'Numeric', value: '42', loc: { start: { line: 1, column: 9 }, end: { line: 1, column: 11 } } }
Note that the line number is one-based while the column number is zero-based.
It is possible to set both range
and loc
to true, thereby giving each token the most complete location information.
By default, Esprima tokenizer ignores every line and block comment. If each comment needs to be included in the output, then the property comment
in the configuration object needs to be set to true. To illustrate this, compare the following simple tokenization:
$ node > var esprima = require('esprima') > esprima.tokenize('/* answer */ 42') [ { type: 'Numeric', value: '42' } ]
with the following situation where the token array also contains the block comment:
$ node > var esprima = require('esprima') > esprima.tokenize('/* answer */ 42', { comment: true }) [ { type: 'BlockComment', value: ' answer ' }, { type: 'Numeric', value: '42' } ]
If the location of each comment is needed, enable the location information using range
and/or loc
(as explained in the previous section):
$ node > var esprima = require('esprima') > esprima.tokenize('/* answer */ 42', { comment: true, range: true }) [ { type: 'BlockComment', value: ' answer ', range: [ 0, 12 ] }, { type: 'Numeric', value: '42', range: [ 13, 15 ] } ]
Since a tokenization process does not have the context of the syntactic structure, it is unable to infer properly that a particular reserved word is being used not as a keyword. Therefore, it always classifies a reserved word as keyword. A simple example to illustrate this limitation:
$ node > var esprima = require('esprima') > esprima.tokenize('x.if = 1') [ { type: 'Identifier', value: 'x' }, { type: 'Punctuator', value: '.' }, { type: 'Keyword', value: 'if' }, { type: 'Punctuator', value: '=' }, { type: 'Numeric', value: '1' } ]
In the above session, the type of the if
token is Keyword
.
This is however different than what will be obtained using Esprima parser since the parser correctly matches the if
token as an object property and therefore constructs an associated Identifier
node, not a Keyword
node.
$ node > var esprima = require('esprima') > esprima.parseScript('x.if = 1').body[0].expression.left.property Identifier { type: 'Identifier', name: 'if' }
JSX is a syntax extension to JavaScript, popularly known to build web applications using React. JSX is not part of any official ECMAScript specification.
Esprima tokenizer is unable to process input source that contains a mix of JavaScript code and JSX. This is because switching to JSX mode requires an understanding of the context, which a tokenizer does not have. In particular, a closing JSX element (such as </title>
) confuses the tokenizer since the forward slash (/
) is identified as the start of a regular expression.
The following Node.js script demonstrates the use of Esprima tokenizer to apply syntax highlighting of JavaScript code fragment. It accepts the input from stdin
and produces color coded version to stdout
by using ANSI escape code.
const esprima = require('esprima'); const readline = require('readline'); const CYAN = '\x1b[36m'; const RESET = '\x1b[0m' let source = ''; readline.createInterface({ input: process.stdin, terminal: false }) .on('line', line => { source += line + '\n' }) .on('close', () => { const tokens = esprima.tokenize(source, { range: true }); const ids = tokens.filter(x => x.type === 'Identifier'); const markers = ids.sort((a, b) => { return b.range[0] - a.range[0] }); markers.forEach(t => { const id = CYAN + t.value + RESET; const start = t.range[0]; const end = t.range[1]; source = source.slice(0, start) + id + source.slice(end); }); console.log(source); });
An example run is shown in the following screenshot (the script is called highlight.js
):
The script uses the readline module to read the input line-by-line, collecting each line to a local string buffer. Once there is no more input, it invokes Esprima tokenizer to break the source into a list of tokens. The script only cares about identifier tokens, hence the filtering. For each token, the starting location is used to determine where to insert the escape code to change the color to cyan and the end location is used to reset the color. This is done from the last identifier token to the first identifier token, which necessities the sorting in reverse order.
For a real-world syntax highlighter that has many more features, take a look at cardinal (source repository: github.com/thlorenz/cardinal). It uses a similar approach, i.e. using Esprima tokenizer to break the source into tokens and then wrap each token with a type-specific color.