Highlighting on this blog

Posted on

Table of contents

1 Introduction

Someone on fedi asked how you could syntax highlight a REPL-session on your website, and since I have it on my website I decided to write this article to explain how I did it to everyone!

It is written with Megaparsec and Skylighting (Kate).

The idea is as follows:

  1. parse a prompt;
  2. parse input;
  3. optionally parse output;
  4. repeat.

The parser for the prompt decides which inputs it accepts, and the input decides what kind of output it expects.

In this article we will look at the following three scenarios:

  • bash with basic commands;
  • bash with the cat-command, showing the content of the file highlighted (here XML);
  • GHCi.

2 How it looks

Below you can see a small REPL-session.

lira@pc:~$ echo hoihoi!
hoihoi!
lira@pc:~$ ghci
ghci Prelude> 5+5
10
ghci Prelude> :m Data.Text
ghci Prelude Data.Text> 5+5
10
ghci Prelude Data.Text> "test"
"test"
ghci Prelude Data.Text> putStrLn "test"
test
ghci Prelude Data.Text> :q
Leaving GHCi.
lira@pc:~$ ghci
ghci Prelude> :q
Leaving GHCi.
lira@pc:~$ cat test.xml
<?xml version="1.0" ?>
<foo xmlns="https://[::1]/foo" xmlns:b="https://[::1]/bar">
	<b:bar yes="no" />
</foo>

Small REPL-session with bash and GHCi

3 The code

3.1 Getting ready

Lets just write a standalone script that takes a REPL as input and returns the highlighted HTML.

First we need some imports and cabal config. We use mtl for the MonadReader, megaparsec for the parsing, blaze for the HTML generation and Skylighting for the highlighting.

#!/usr/bin/env cabal
{- cabal:
	build-depends:
		base, text,
		containers, mtl,
		megaparsec
		blaze-html, blaze-markup,
		skylighting, skylighting-core,
	ghc-options: -Wno-tabs
-}
{-# LANGUAGE OverloadedStrings, LambdaCase, FlexibleContexts #-}
module Main where

import Data.Void (Void)
import Data.Text (Text, pack, singleton, stripEnd, cons, intercalate)
import Data.Functor (($>))
import Data.Functor.Identity (runIdentity)
import Data.List (intersperse)
import Data.Map (union)

import Control.Applicative (liftA2, (<|>), asum)
import Control.Monad.Reader (MonadReader, ask, runReaderT)

import qualified Text.Megaparsec as P
import qualified Text.Megaparsec.Char as P
import Text.Blaze.Html (Html, toHtml)
import Text.Blaze.Internal (MarkupM(..))
import qualified Text.Blaze.XHtml5 as H
import qualified Text.Blaze.XHtml5.Attributes as A
import Text.Blaze.Html.Renderer.String (renderHtml)

import Skylighting (defaultFormatOpts, tokenize, TokenizerConfig(..), lookupSyntax, defaultSyntaxMap, formatHtmlInline, SyntaxMap, SourceLine(..))
import Skylighting.Loader (loadSyntaxesFromDir)
Shebang, cabal config and imports

We also need some input:

input :: Text
input = intercalate "\n"
	[ "lira@pc:~$ echo hoihoi!"
	, "hoihoi!"
	, "lira@pc:~$ ghci"
	, "ghci Prelude> 5+5"
	, "10"
	, "ghci Prelude> :m Data.Text"
	, "ghci Prelude Data.Text> 5+5"
	, "10"
	, "ghci Prelude Data.Text> \"test\""
	, "\"test\""
	, "ghci Prelude Data.Text> putStrLn \"test\""
	, "test"
	, "ghci Prelude Data.Text> :q"
	, "Leaving GHCi."
	, "lira@pc:~$ ghci"
	, "ghci Prelude> :q"
	, "Leaving GHCi."
	, "lira@pc:~$ cat test.xml"
	, "<?xml version=\"1.0\" ?>"
	, "<foo xmlns=\"https://[::1]/foo\" xmlns:b=\"https://[::1]/bar\">"
	, "\t<b:bar yes=\"no\" />"
	, "</foo>"
	]
The input REPL

3.2 The entry-point

Our main looks like this: we load the syntax-files, run the highlighter, render it as HTML and finally print it to the console.

main :: IO ()
main = do
	syntaxes <- loadSyntaxes
	let
		output = runHighlighter syntaxes input
		renderedOutput = renderHtml output
	putStrLn renderedOutput
main-function

You might only need the defaultSyntaxMap configuration, but for my blog I need more syntax files. One example is for my article about PDF, where the PDF-syntax does not come pre-installed with Skylighting.

loadSyntaxes :: IO SyntaxMap
loadSyntaxes =  loadSyntaxesFromDir "./syntax/" >>= \case
	Left _ -> pure defaultSyntaxMap
	Right b -> pure $ b `union` defaultSyntaxMap
loadSyntaxes definition

In order to store those syntax files and not have to pass them around as parameters, we will use a MonadReader. We set that up in runHighlighter:

runHighlighter :: SyntaxMap -> Text -> Html
runHighlighter syntaxes code =
	runIdentity (runReaderT (highlightPromptOutput code) syntaxes)
runHighlighter definition

3.3 Using Skylighting

Let's define a function highlight that takes care of all the skylighting configuration. It only needs input and language, and will output our code highlighted:

highlight :: (MonadReader SyntaxMap m) => Text -> Text -> m Html
highlight code language = do
	syntaxes <- ask
	pure $ case lookupSyntax language syntaxes of
		Nothing -> toHtml code
		Just syntax -> case tokenize (tokenizerConfig syntaxes) syntax code of
			Left _ -> toHtml code
			Right sourceLines -> removeCruft sourceLines
	where
		tokenizerConfig :: SyntaxMap -> TokenizerConfig
		tokenizerConfig sm = TokenizerConfig
			{ traceOutput = False
			, syntaxMap = sm
			}
		removeCruft :: [SourceLine] -> Html
		removeCruft sourceLines =
			case formatHtmlInline defaultFormatOpts sourceLines of
				(AddAttribute _ _ _ (Parent _ _ _ subElements)) -> subElements
				o -> o
highlight definition

The removeCruft is only there to get rid of extra HTML that skylighting wraps around the output. If you want to keep that you can leave that part out.

3.4 Parsing with megaparsec

We now have the highlighter we can use for code blocks containing a single language. We want a REPL-session, which has multiple languages. We need to work with parsers as the input has to be static and line-based, instead of how most syntax highlighters work. The parser also has to figure out which language to use.

All parsers need a ParsecT-type. Let's create a type for it so we don't have to repeat ourselves to much:

type Parser m a = P.ParsecT Void Text m a
The Parser-type

In Parser m a, m is the monad and a is the type being parsed.

For the parsing we also use the MonadReader, but because it has to be wrapped with the ParsecT we have to call runReaderT again.

highlightPromptOutput :: (MonadReader SyntaxMap m) => Text -> m Html
highlightPromptOutput t = do
	syntaxes <- ask
	runReaderT (P.runParserT prompts "" t) syntaxes >>= case
		(Left _) -> pure $ H.text t
		(Right html) -> pure $ foldr1 (*>) html
highlightPromptOutput definition

If parsing is successful we will get multiple lines of code, which we foldr1 nicely together. If it is not successful, we just return the plain text. We cannot do anything else to recover.

For us prompts will be bash and GHCi.

prompt :: (MonadReader SyntaxMap m) => Parser m Html
prompt = asum $ P.try <$> [ bashPrompt, ghciPrompt ]

prompts :: (MonadReader SyntaxMap m) => Parser m [Html]
prompts = P.many prompt
prompts definition

Before we get into the parsers themselves, lets also create a function that will let us wrap spans with certain classes around the generated HTML. That gives us some more control over how to highlight prompts.

textWithClass :: Text -> Text -> Html
textWithClass c t = H.span H.! A.class_ (H.textValue c) $ H.text t
textWithClass definition

3.5 Parsing input and output

For parsing input and output, we need the following parsers:

closingNewline :: Parser m Text
closingNewline = singleton <$> P.single '\n' <|> P.eof $> ""

inputP :: Parser m Text
inputP = do
	i <- P.some (P.chunk "\\\n" <|> (singleton <$> P.anySingleBut '\n'))
	end <- closingNewline
	pure $ mconcat i <> end

outputP :: Parser m Text
outputP = do
	P.notFollowedBy promptStartP
	o <- pack <$> P.many (P.try (P.single '\n' <* P.notFollowedBy promptStartP) <|> P.anySingleBut '\n')
	end <- closingNewline
	pure $ o <> end
	where
		promptStartP :: Parser m Html
		promptStartP = asum [ ghciP, userNameP ]
Parsers for input and output

You might wonder what ghciP or userNameP are. Those are our starting points for the prompts, the first part of the GHCi and bash prompt parsers.

3.6 bash prompt

The default prompt in Ubuntu can be summarized with the following regex:

^(\w+)@(\w+):([a-zA-Z/~\.]+)(\$|#) ([^]*?)(?<!\\)\n[^\n]*
Regex of bash prompt

While I currently use Arch, I still really like the default prompt Ubuntu came with so I will stick with it!

The regex translated to a parser looks like this:

bashPrompt :: (MonadReader SyntaxMap m) => Parser m Html
bashPrompt = do
	userName <- userNameP
	host <- hostP
	path <- pathP
	privilege <- privilegeP <* P.single ' '
	(input, output) <- catFileP <|> basicIOP
	bashCode <- highlight input "bash"
	pure $ do
		userName
		textWithClass "prompt-at" "@"
		host
		textWithClass "prompt-colon" ":"
		path
		privilege
		H.text " "
		H.kbd bashCode
		H.text "\n"
		output
userNameP :: Parser m Html
userNameP = textWithClass "prompt-user" . pack <$> P.many P.alphaNumChar <* P.single '@'
hostP :: Parser m Html
hostP = textWithClass "prompt-host" . pack <$> P.many P.alphaNumChar <* P.single ':'
pathP :: Parser m Html
pathP = textWithClass "prompt-path" . pack <$> P.many (P.alphaNumChar <|> P.oneOf ("/~." :: [Char]))
privilegeP :: Parser m Html
privilegeP = textWithClass "prompt-privilege" . singleton <$> (P.single '$' <|> P.single '#')
bash-prompt

Now we only need the parsers for catFileP and basicIOP

The parser basicIOP is very straightforward.

basicIOP :: Parser m (Text, Html)
basicIOP = liftA2 (,) inputP $ P.optional outputP >>= \case
	Just output -> pure . H.span $ H.text output
	Nothing -> pure $ H.text ""
basicIOP definition

The parser catFileP is a bit more work. Here we need to use highlight to get the correct highlighting for the shown file.

catFile :: (MonadReader SyntaxMap m) => Parser m (Text, Html)
catFile = do
	syntaxes <- ask
	i <- mconcat <$> sequence
		[ P.chunk "cat "
		, pack <$> P.many P.alphaNumChar
		, singleton <$> P.single '.'
		]
	extension <- pack <$> P.many P.alphaNumChar <* P.single '\n'
	o <- outputP
	let
		l = extensionToHL extension
	
	o' <- highlight o l
	let
		o'' = H.span H.! A.class_ (H.textValue l) $ o'
	pure (i <> extension <> "\n", o'' *> H.text "\n")
	where
		extensionToHL :: Text -> Text
		extensionToHL "hs" = "haskell"
		extensionToHL e = e
catFileP definition

3.7 GHCi prompt

The prompt of GHCi has the following pattern:

ghci( [A-Z][a-zA-Z.]+)*(>|\|) [^\n]*\n[^\n]*\n
Regex of GHCi prompt

Translated into a parser, it looks like this:

ghciPrompt :: (MonadReader SyntaxMap m) => Parser m Html
ghciPrompt = do
	ghci <- ghciP <* P.single ' '
	modules <- ghciModules
	end <- textWithClass "end" . singleton <$> (P.single '>' <|> P.single '|') <* P.single ' '
	io <- ghciSettingInput <|> ghciHaskellIO
	pure $ ghci *> H.text " " *> modules *> end *> H.text " " *> io
ghciP :: Parser m Html
ghciP = textWithClass "ghci" <$> P.try (P.chunk "ghci")
ghciModule :: Parser m Html
ghciModule = textWithClass "module" . pack <$> P.some (P.alphaNumChar <|> P.single '.')
ghciModules :: Parser m Html
ghciModules = do
	ms <- P.many (ghciModule <* P.optional (P.single ' '))
	pure $ foldr1 (*>) (intersperse (H.text " ") ms)
GHCi prompt

Again, we branch of with the I/O-part where input can either be commands to change the environment, or the REPL we most often use.

ghciSettingInput :: Parser m Html
ghciSettingInput = do
	P.single ':'
	command <- asum ((liftA2 commandToHtml `uncurry`) <$>
		[ (P.chunk "m", P.single ' ' *> (Just <$> ghciModule))
		, (P.chunk "module", P.single ' ' *> (Just <$> ghciModule))
		, (P.chunk "q", pure Nothing)
		, (P.chunk "quit", pure Nothing)
		, (P.chunk "{", pure Nothing)
		, (P.chunk "}", pure Nothing)
		])
	closingNewline
	output <- P.optional outputP
	pure $ do
		command
		H.text "\n"
		case output of
			Just output -> H.text output
			Nothing -> H.text ""
	where
		commandToHtml :: Text -> Maybe Html -> Html
		commandToHtml c Nothing = textWithClass "command" (':' `cons` c)
		commandToHtml c (Just a) = commandToHtml c Nothing *> H.text " " *> a

ghciHaskellIO :: (MonadReader SyntaxMap m) => Parser m Html
ghciHaskellIO = do
	input <- inputP
	output <- outputP
	iH <- highlight input "haskell"
	oH <- highlight output "haskell"
	pure $ do
		H.kbd iH
		H.text "\n"
		oH
		H.text "\n"
ghciSettingInput and ghciHaskellIO parsers

4 Conclusion

And there you have it, highlighting for your website!

5 Sources