Implementing Prism from Scratch

You are reading a translation of an old blog post published on my previous blog in French.

Many thanks to Lea Verou, et al., for Prism.js!

— Brendan Eich, creator of JavaScript

Many JavaScript libraries have supported syntactic coloration for a long time. A recent newcomer has become popular and is already used by major websites like Mozilla. This library is Prism and was created by Lea Verou, author of the library -prefix-free and code playbook Dabblet. Using Prism, our code is even more beautiful. What is the most surprising about Prims is the size of the codebase: only 400 lines of JavaScript (Google Prettify and SyntaxHighlighter count more than 2000 lines).

How does Prism achieve this tour de force? We’ll find out by rewriting Prism from scratch.

Prism is published under the MIT license. The code presented in this article has been simplified for obvious reasons and must not be used outside this learning context. This article is based on the latest version of Prism at the moment of the publication of this article.

Let’s Go!

The documentation to extend Prism gives us interesting details about the inner working of the library. Let’s start by outlining the global structure:

var Prism = (function() {

 var self = {

  languages : {}, // Every supported langague is added here

  highlightAll : function() {
   var elements = document.querySelectorAll('code[class*="language-"]');

   for (var i = 0, element; element = elements[i++];) {
    self.highlightElement(element);
   }
  },

  highlightElement : function(element) {
   var language = element.className.match(/\blanguage-(?!\*)(\w+)\b/i)[1];
   var grammar = self.languages[language];

   // Set the language on the parent, for styling
   var parent = element.parentNode;
   if (/pre/i.test(parent.nodeName)) {
    parent.className = parent.className + ' language-' + language;
   }

   var code = element.textContent;

   element.innerHTML = self.highlight(code, grammar, language);
  },

  highlight : function(text, grammar, language) {
   var tokens = self.tokenize(text, grammar);
   return self.Token.stringify(tokens, language);
  },

  tokenize : function(text, grammar) {
  // This is the most low-level function implementing the
  // lexical analyzer.
  }
 };

 return self;

})();

document.addEventListener('DOMContentLoaded', Prism.highlightAll);

¹: This is the entry point. We look for all source code present in the document.
²: We replace the previous content with the stylized one.
³: We tokenize the source code to colorize it using CSS classes.

The code is relatively simple to apprehend. We search the tags <code> having a CSS class starting with language-. We extract their content to split it into tokens as a compiler or interpreter would do. The main difference is that a compiler goes well beyond by creating an abstract syntaxic tree as an intermediate representation before generating code in the target language. Here, we don’t even try to ensure the code is valid. We are just looking for tokens to surround them with new tags having CSS classes on them.

The Lexical Analyzer

The function tokenize accepts two parameters:

text: the content of a tag <code>.
grammar: the definition of the programming language, often defined in different files.

Let’s take an example to illustrate the working of this function:

<pre><code class="language-javalite">
public class HelloWorld {

    public static void main(String[] args) {
        String test = "Hello World!";
        System.out.println(test);
    }
}
</code></pre>

We define a subset of the Java language named javalite:

Prism.languages.javalite = {
 'keyword': /\b(public|static|class|void)\b/g,
 'string': /("|')(\\?.)*?\1/g,
 'punctuation': /[{}[\];(),.:]/g
};

The variable text contains the code of our program HelloWorld and grammar contains the object Prism.languages.javalite.

Note that the definition of this javalite language is basic. Prism supports more options to address more exotic rules that will be discussed later in this article. Our definition consists only of three tokens with a regular expression to found them.

Token ou Lexeme?

The Dragon Book brings us the answer: “A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.”

To illustrate this difference using the previous example, our definition of language javalite uses three tokens (ex: keyword). The strings "public" or "static" are examples of lexemes of the same token keyword.

This better definition is not followed in the source code of Prism where lexemes and tokens are named using the same term token.

Here is the result returned by the function tokenize:

[
  "\n",
  { type: "keyword", content: "public"},
  " ",
  { type: "keyword", content: "class"},
  " HelloWorld ",
  { type: "punctuation", content: "{"},
  "\n\n",
  { type: "keyword", content: "public"},
  " ",
  { type: "keyword", content: "static"},
  " ",
  { type: "keyword", content: "void"},
  " main",
  { type: "punctuation", content: "("},
  "String",
  { type: "punctuation", content: "["},
  { type: "punctuation", content: "]"},
  " args",
  { type: "punctuation", content: ")"},
  " ",
  { type: "punctuation", content: "{"},
  "\n\t\tString test = ",
  { type: "string", content: "\"Hello World!\""},
  { type: "punctuation", content: ";"},
  "\n\t\tSystem",
  { type: "punctuation", content: "."},
  "out",
  { type: "punctuation", content: "."},
  "println",
  { type: "punctuation", content: "("},
  "test",
  { type: "punctuation", content: ")"},
  { type: "punctuation", content: ";"},
  "\n",
  { type: "punctuation", content: "}"},
  "\n",
  { type: "punctuation", content: "}"},
  "\n"
]

We retrieve our code sample divided into lexemes. For each lexeme having an associated token (string, punctuation, keyword), an object Token is created containing the text of the lexeme and the name of the token:

Token: function(type, content) {
 this.type = type;
 this.content = content;
}

Confused? Don’t worry. We will go back on the lexical analyzer in the last part.

The Syntaxic Coloration

Once the list of lexemes is identified, colorizing the code is trivial. It’s the job of the method Token.stringify:

Token.stringify = function(o, language, parent) {
 if (typeof o == 'string') { // Lexeme without defined token?
  return o;
 }

 if (Array.isArray(o)) { // List of lexemes => recurse
  return o.map(function(element) {
   return Token.stringify(element, language, o);
  }).join('');
 }

 var content = Token.stringify(o.content, language, parent);
 var classes = [ 'token', o.type ];

 return '<span class="' + classes.join(' ') + '">' + content + '</span>';
};

This recursive method is called initially with the complete list of lexemes. For every lexeme without a token found, the original value is preserved. For other lexemes, we decorate the value using a new tag <span> having the CSS classes token and the token name (keyword, punctuation, string, …).

Then, we have to define a few CSS declarations. (The tag <pre> is important to preserve the spacing and newlines).

pre {
    font-family: Consolas, Monaco, 'Andale Mono', monospace;*
    line-height: 1.5;
    color: black;
}
.token.punctuation {
    color: #999;
}
.token.string {
    color: #690;
}
.token.keyword {
    color: #07a;
}

Here is what our code looks like when these styles are applied:

The last missing piece from our puzzle is still the lexical analyzer.

The Lexical Analyzer (Again)

Let’s get started with a first version supporting the previous basic grammar:

tokenize : function(text, grammar) {
    var strarr = [ text ];

    tokenloop: for ( var token in grammar) {
        if (!grammar.hasOwnProperty(token) || !grammar[token]) {
            continue;
        }

        var pattern = grammar[token];

        for (var i = 0; i < strarr.length; i++) {

            var str = strarr[i];

            if (str instanceof self.Token) {
                continue;
            }

            var match = pattern.exec(str);

            if (match) {
                var from = match.index - 1,
                    match = match[0],
                    len = match.length,
                    to = from + len,
                    before = str.slice(0, from + 1),
                    after = str.slice(to + 1);

                var args = [ i, 1 ];

                if (before) {
                    args.push(before);
                }

                var wrapped = new self.Token(token, match);

                args.push(wrapped);

                if (after) {
                    args.push(after);
                }

                Array.prototype.splice.apply(strarr, args);
            }
        }
    }

    return strarr;
}

At first, the function may seem obscure but the logic is more simple as it may seem. For every token defined of the language grammar, we iterate over the input list containing initially a single string with the complete source code, but after several iterations, this string will be split into lexemes.

Let’s unwind the algorithm on our example, considering only the token keyword defined by the regular expression: /\b(public|static|class|void)\b/g:

strarray = ['public class HelloWorld { ... }'];
i = 0       +-----------------------------+

Does 'public class HelloWorld { ... }' matches the regular expression? Yes

We replace this element with three new elements:

The string before the match: the string is empty. We have nothing to add.
The found lexeme: public.
The string after the match: ' class HelloWorld { ... }'.

strarray = [Token, ' class HelloWorld { ... }'];
i = 1              +-----------------------+

Does ' class HelloWorld { ... }' matches the regular expression? Yes

Similarly, we replace the element with three new elements:

The string before the match: the space character.
The found lexeme: class.
The string after the match: ' HelloWorld { ... }'.

strarray = [Token, ' ', Token, ' HelloWorld { ... }'];
i = 2                   +---+

The element is already a processed token. We continue.

strarray = [Token, ' ', Token, ' HelloWorld { ... }'];
i = 3                          +-----------------+

Does ' HelloWorld { ... }' matches the regular expression? No

After several more iterations, we finally reach the end of the array, before restarting the same logic with the next token, and so on, until having processed the whole grammar.

Congratulations!

We have finished the rewrite of Prism. Less than 120 lines of code have been necessary. You can find the complete source code here.

Bonus: The Reality of Programming Languages

Defining tokens using regular expressions is common. The program LEX, created in 1975 by Mike Lesk et Eric Schmidt, worked already like that. Sadly, regular expressions have limitations, especially as their support in some languages like JavaScript is not as complete as reference languages like Perl.

An example: Java class names

A first regular expression would be: [a-z0-9_]+

Problem: This regular expression returns also variables and constants.
Solution: We can use Java conventions to only matches identifiers starting with an uppercase letter, but this solution is probably too restrictive for a library like Prism. The solution implemented by Prims is different. A class name is expected at well-defined places (ex: after the keyword class). The idea is to look around the matches. We can do that with regular expressions. But…

lookbehind + lookahead = lookaround

Lookahead and Lookbehind support assertions about what must precede or follow the match. For example:

java(?!script) searches for occurrences of java not followed by script (java, javafx but not javascript).
We talk about Negative Lookahead.
java(?=script) searches for occurrences of java followed by script (javascript but not java or javafx).
We talk about Positive Lookahead.
(?<!java)script searches for occurrences of script not preceded by java (script, postscript but not javascript).
We talk about Negative Lookbehind.
(?<=java)script searches for occurrences of script preceded by java (javascript but not postscript).
We talk about Positive Lookbehind.

Caution: The regular expression (?<=java)script is different from javascript. The characters satisfying the lookarounds are not returned in the matching string (the result is script for the first regular expression and javascript for the second one).

The idea behind lookarounds is relatively easy to grasp. But their support varies between languages. For example, many languages, including Perl, restrict the characters allowed in a lookbehind (no metacharacters allowed since Perl must determine the number of characters he must go back). You can find more information here.

What about JavaScript? The answer is simple: JavaScript does not support lookaheads. Therefore, Prism has to implement a workaround:

Prism.languages.javalite = {
  'class-name': {
    pattern: /(?:(class|interface|extends|implements|instanceof|new)\s+)[a-z0-9_]+/ig,
    lookbehind: true
  }
};

With this new definition, we are looking for identifiers preceded by one of the defined keywords. From the implementation, if lookbehind is enabled, Prism removes the value of the first captured group to define the actual value of the lexeme.

Here the method tokenize with the changed lines highlighted:

1
tokenize : function(text, grammar) {
2
 var strarr = [ text ];
3

4
 tokenloop: for ( var token in grammar) {
5
  if (!grammar.hasOwnProperty(token) || !grammar[token]) {
6
   continue;
7
  }
8

9
  var pattern = grammar[token],
10
      lookbehind = !!pattern.lookbehind,
11
      lookbehindLength = 0;
12

13
  pattern = pattern.pattern || pattern;
14

15
  for (var i = 0; i < strarr.length; i++) {
16
    // Don’t cache length as it changes during the loop
17

18
   var str = strarr[i];
19

20
   if (str instanceof self.Token) {
21
    continue;
22
   }
23

24
   var match = pattern.exec(str);
25

26
   if (match) {
27
    if (lookbehind) {
28
      lookbehindLength = match[1].length;
29
    }
30

31
    var from = match.index - 1 + lookbehindLength,
32
        match = match[0].slice(lookbehindLength),
33
        len = match.length,
34
        to = from + len,
35
        before = str.slice(0, from + 1),
36
        after = str.slice(to + 1);
37

38
    var args = [ i, 1 ];
39

40
    if (before) {
41
     args.push(before);
42
    }
43

44
    var wrapped = new self.Token(token, match);
45

46
    args.push(wrapped);
47

48
    if (after) {
49
     args.push(after);
50
    }
51

52
    Array.prototype.splice.apply(strarr, args);
53
   }
54
  }
55
 }
56

57
 return strarr;
58
}

With this new feature, we can now test our code with more advanced examples:

Quiz: Which token matches this regular expression?

/(^|[^/])\/(?!\/)(\[.+?]|\\.|[^/\r\n])+\/[gim]{0,3}(?=\s*($|[\r\n,.;})]))/g

Solution: This regular expression matches… regular expressions.

You may notice the use of the lookbehind workaround supported by Prism and the lookahead supported by all browsers.

Here is the complete rewrite:

var Prism = (function() {

 var self = {

  languages : {},

  highlightAll : function() {
   var elements = document.querySelectorAll('code[class*="language-"]');

   for (var i = 0, element; element = elements[i++];) {
    self.highlightElement(element);
   }
  },

  highlightElement : function(element) {
   var language = element.className.match(/\blanguage-(?!\*)(\w+)\b/i)[1];
   var grammar = self.languages[language];

   // Set language on the parent, for styling
   var parent = element.parentNode;
   if (/pre/i.test(parent.nodeName)) {
    parent.className = parent.className + ' language-' + language;
   }

   var code = element.textContent;

   element.innerHTML = self.highlight(code, grammar, language);
  },

  highlight : function(text, grammar, language) {
   var tokens = self.tokenize(text, grammar);
   return self.Token.stringify(tokens, language);
  },

  tokenize : function(text, grammar) {
   var strarr = [ text ];

   tokenloop: for ( var token in grammar) {
    if (!grammar.hasOwnProperty(token) || !grammar[token]) {
     continue;
    }

    var pattern = grammar[token],
        lookbehind = !!pattern.lookbehind,
        lookbehindLength = 0;

    pattern = pattern.pattern || pattern;

    for (var i = 0; i < strarr.length; i++) {
      // Don’t cache length as it changes during the loop

     var str = strarr[i];

     if (str instanceof self.Token) {
      continue;
     }

     var match = pattern.exec(str);

     if (match) {
      if (lookbehind) {
       lookbehindLength = match[1].length;
      }

      var from = match.index - 1 + lookbehindLength,
          match = match[0].slice(lookbehindLength),
          len = match.length,
          to = from + len,
          before = str.slice(0, from + 1),
          after = str.slice(to + 1);

      var args = [ i, 1 ];

      if (before) {
       args.push(before);
      }

      var wrapped = new self.Token(token, match);

      args.push(wrapped);

      if (after) {
       args.push(after);
      }

      Array.prototype.splice.apply(strarr, args);
     }
    }
   }

   return strarr;
  },

  Token: function(type, content) {
   this.type = type;
   this.content = content;

   self.Token.stringify = function(o, language, parent) {
    if (typeof o == 'string') {
     return o;
    }

    if (Array.isArray(o)) {
     return o.map(function(element) {
      return self.Token.stringify(element, language, o);
     }).join('');
    }

    var content = self.Token.stringify(o.content, language, parent);
    var classes = [ 'token', o.type ];

    return '<span class="' + classes.join(' ') + '">' + content + '</span>';
   };
  }
 };

 return self;

})();

document.addEventListener('DOMContentLoaded', Prism.highlightAll);


Prism.languages.java = {
 // C-like
 'comment': {
   pattern: /(^|[^\\])\/\*[\w\W]*?\*\//g,
   lookbehind: true
 },
 'string': /("|')(\\?.)*?\1/g,
 'class-name': {
  pattern: /((?:(?:class|interface|extends|implements|trait|instanceof|new)\s+)|(?:catch\s+\())[a-z0-9_\.\\]+/ig,
  lookbehind: true
 },
 'keyword': /\b(if|else|while|do|for|return|in|instanceof|function|new|try|throw|catch|finally|null|break|continue)\b/g,
 'boolean': /\b(true|false)\b/g,
 'function': {
  pattern: /[a-z0-9_]+\(/ig,
 },
 'number': /\b-?(0x[\dA-Fa-f]+|\d*\.?\d+([Ee]-?\d+)?)\b/g,
 'operator': /[-+]{1,2}|!|<=?|>=?|={1,3}|&{1,2}|\|?\||\?|\*|\/|\~|\^|\%/g,
 'ignore': /&(lt|gt|amp);/gi,
 'punctuation': /[{}[\];(),.:]/g,

 // Java Specific
 'keyword': /\b(abstract|continue|for|new|switch|assert|default|goto|package|synchronized|boolean|do|if|private|this|break|double|implements|protected|throw|byte|else|import|public|throws|case|enum|instanceof|return|transient|catch|extends|int|short|try|char|final|interface|static|void|class|finally|long|strictfp|volatile|const|float|native|super|while)\b/g,
 'number': /\b0b[01]+\b|\b0x[\da-f]*\.?[\da-fp\-]+\b|\b\d*\.?\d+[e]?[\d]*[df]\b|\W\d*\.?\d+\b/gi,
 'operator': {
  pattern: /(^|[^\.])(?:\+=|\+\+?|-=|--?|!=?|<{1,2}=?|>{1,3}=?|==?|&=|&&?|\|=|\|\|?|\?|\*=?|\/=?|%=?|\^=?|:|~)/gm,
  lookbehind: true
 }
};

Try for yourself!

Prism provides hooks to extend the library with plugins. To understand these extension points and how plugins use them, you can check prism-core.js and the directory plugins.
Prism supports one language to include other languages (ex: HTML files often contain JavaScript and CSS blocks). The implementation is elegant, requiring only a dozen of lines of code. Check the file prism-core.js. Hint: Search for properties inside and rest.

To Remember

Mastering regular expressions is a superpower for a developer.
JavaScript does not support lookbehinds.
Token != Lexeme.

About the author

Julien Sobczak works as a software developer for Scaleway, a French cloud provider. He is a passionate reader who likes to see the world differently to measure the extent of his ignorance. His main areas of interest are productivity (doing less and better), human potential, and everything that contributes in being a better person (including a better dad and a better developer).

Read Full Profile