You are reading a translation of an old blog post published on my previous blog in French.
Many thanks to Lea Verou, et al., for Prism.js!
— Brendan Eich, creator of JavaScript
Many JavaScript libraries have supported syntactic coloration for a long time. A recent newcomer has become popular and is already used by major websites like Mozilla. This library is Prism and was created by Lea Verou, author of the library -prefix-free and code playbook Dabblet. Using Prism, our code is even more beautiful. What is the most surprising about Prims is the size of the codebase: only 400 lines of JavaScript (Google Prettify and SyntaxHighlighter count more than 2000 lines).
How does Prism achieve this tour de force? We’ll find out by rewriting Prism from scratch.
Prism is published under the MIT license. The code presented in this article has been simplified for obvious reasons and must not be used outside this learning context. This article is based on the latest version of Prism at the moment of the publication of this article.
Let’s Go!
The documentation to extend Prism gives us interesting details about the inner working of the library. Let’s start by outlining the global structure:
- 1
- This is the entry point. We look for all source code present in the document.
- 2
- We replace the previous content with the stylized one.
- 3
- We tokenize the source code to colorize it using CSS classes.
The code is relatively simple to apprehend. We search the tags <code>
having a CSS class starting with language-
. We extract their content to split it into tokens as a compiler or interpreter would do. The main difference is that a compiler goes well beyond by creating an abstract syntaxic tree as an intermediate representation before generating code in the target language. Here, we don’t even try to ensure the code is valid. We are just looking for tokens to surround them with new tags having CSS classes on them.
The Lexical Analyzer
The function tokenize
accepts two parameters:
text
: the content of a tag<code>
.grammar
: the definition of the programming language, often defined in different files.
Let’s take an example to illustrate the working of this function:
We define a subset of the Java language named javalite
:
The variable text
contains the code of our program HelloWorld
and grammar
contains the object Prism.languages.javalite
.
Note that the definition of this javalite
language is basic. Prism supports more options to address more exotic rules that will be discussed later in this article. Our definition consists only of three tokens with a regular expression to found them.
The Dragon Book brings us the answer: “A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.”
To illustrate this difference using the previous example, our definition of language javalite
uses three tokens (ex: keyword
). The strings "public"
or "static"
are examples of lexemes of the same token keyword
.
This better definition is not followed in the source code of Prism where lexemes and tokens are named using the same term token.
Here is the result returned by the function tokenize
:
We retrieve our code sample divided into lexemes. For each lexeme having an associated token (string
, punctuation
, keyword
), an object Token
is created containing the text of the lexeme and the name of the token:
Confused? Don’t worry. We will go back on the lexical analyzer in the last part.
The Syntaxic Coloration
Once the list of lexemes is identified, colorizing the code is trivial. It’s the job of the method Token.stringify
:
This recursive method is called initially with the complete list of lexemes. For every lexeme without a token found, the original value is preserved. For other lexemes, we decorate the value using a new tag <span>
having the CSS classes token
and the token name (keyword
, punctuation
, string
, …).
Then, we have to define a few CSS declarations. (The tag <pre>
is important to preserve the spacing and newlines).
Here is what our code looks like when these styles are applied:
The last missing piece from our puzzle is still the lexical analyzer.
The Lexical Analyzer (Again)
Let’s get started with a first version supporting the previous basic grammar:
At first, the function may seem obscure but the logic is more simple as it may seem. For every token defined of the language grammar, we iterate over the input list containing initially a single string with the complete source code, but after several iterations, this string will be split into lexemes.
Let’s unwind the algorithm on our example, considering only the token keyword
defined by the regular expression: /\b(public|static|class|void)\b/g
:
Does 'public class HelloWorld { ... }'
matches the regular expression? Yes
We replace this element with three new elements:
- The string before the match: the string is empty. We have nothing to add.
- The found lexeme:
public
. - The string after the match:
' class HelloWorld { ... }'
.
Does ' class HelloWorld { ... }'
matches the regular expression? Yes
Similarly, we replace the element with three new elements:
- The string before the match: the space character.
- The found lexeme:
class
. - The string after the match:
' HelloWorld { ... }'
.
The element is already a processed token. We continue.
Does ' HelloWorld { ... }'
matches the regular expression? No
After several more iterations, we finally reach the end of the array, before restarting the same logic with the next token, and so on, until having processed the whole grammar.
We have finished the rewrite of Prism. Less than 120 lines of code have been necessary. You can find the complete source code here.
Bonus: The Reality of Programming Languages
Defining tokens using regular expressions is common. The program LEX, created in 1975 by Mike Lesk et Eric Schmidt, worked already like that. Sadly, regular expressions have limitations, especially as their support in some languages like JavaScript is not as complete as reference languages like Perl.
An example: Java class names
A first regular expression would be: [a-z0-9_]+
Problem: This regular expression returns also variables and constants.
Solution: We can use Java conventions to only matches identifiers starting with an uppercase letter, but this solution is probably too restrictive for a library like Prism. The solution implemented by Prims is different. A class name is expected at well-defined places (ex: after the keyword class
). The idea is to look around the matches. We can do that with regular expressions. But…
Lookahead and Lookbehind support assertions about what must precede or follow the match. For example:
java(?!script)
searches for occurrences ofjava
not followed byscript
(java
,javafx
but notjavascript
).
We talk about Negative Lookahead.java(?=script)
searches for occurrences ofjava
followed byscript
(javascript
but notjava
orjavafx
).
We talk about Positive Lookahead.(?<!java)script
searches for occurrences ofscript
not preceded byjava
(script
,postscript
but notjavascript
).
We talk about Negative Lookbehind.(?<=java)script
searches for occurrences ofscript
preceded byjava
(javascript
but notpostscript
).
We talk about Positive Lookbehind.
Caution: The regular expression (?<=java)script
is different from javascript
. The characters satisfying the lookarounds are not returned in the matching string (the result is script
for the first regular expression and javascript
for the second one).
The idea behind lookarounds is relatively easy to grasp. But their support varies between languages. For example, many languages, including Perl, restrict the characters allowed in a lookbehind (no metacharacters allowed since Perl must determine the number of characters he must go back). You can find more information here.
What about JavaScript? The answer is simple: JavaScript does not support lookaheads. Therefore, Prism has to implement a workaround:
With this new definition, we are looking for identifiers preceded by one of the defined keywords. From the implementation, if lookbehind is enabled, Prism removes the value of the first captured group to define the actual value of the lexeme.
Here the method tokenize
with the changed lines highlighted:
With this new feature, we can now test our code with more advanced examples:
Solution: This regular expression matches… regular expressions.
You may notice the use of the lookbehind workaround supported by Prism and the lookahead supported by all browsers.
Here is the complete rewrite:
- Prism provides hooks to extend the library with plugins. To understand these extension points and how plugins use them, you can check prism-core.js and the directory plugins.
- Prism supports one language to include other languages (ex: HTML files often contain JavaScript and CSS blocks). The implementation is elegant, requiring only a dozen of lines of code. Check the file prism-core.js. Hint: Search for properties
inside
andrest
.
- Mastering regular expressions is a superpower for a developer.
- JavaScript does not support lookbehinds.
- Token != Lexeme.