Note: This is an old post originally from the documentation of the Sixten programming language, that I've touched up and fleshed out. After the time that it was written I've found out about Salsa, a Rust library with very similar goals to my Rock library, which is definitely worth checking out as well!
Compilers are no longer just black boxes that take a bunch of source files and produce assembly code. We expect them to:
This is what Anders Hejlsberg talks about in his video on modern compiler construction that some of you might have seen.
In this post I will cover how this is achieved in Sixten by building the compiler around a query system.
For those of you that don't know, Sixten is an experimental functional programming language created to give the programmer more control over memory layout and boxing than most other high-level languages do. The most recent development of Sixten is being done in the Sixty repository, and is completely query-based. Here's a little video giving a taste of what its language server can do, showing type-based completions:
A traditional compiler pipeline might look a bit like this:
+-----------+ +-----+ +--------+ +--------+
| | | | | | | |
|source text|---parse--->| AST |---typecheck-+->|core AST|---generate--->|assembly|
| | | | ^ | | | |
+-----------+ +-----+ | +--------+ +---------
|
read and write
types
|
v
+----------+
| |
|type table|
| |
+----------+
There are many variations, and often more steps and intermediate representations than in the illustration, but the idea stays the same:
We push source text down a pipeline and run a fixed set of transformations until we finally output assembly code or some other target language. Along the way we often need to read and update some state. For example, we might update a type table during type checking so we can later look up the type of entities that the code refers to.
Traditional compiler pipelines are probably quite familiar to many of us, but how query-based compilers should be architected might not be as well-known. Here I will describe one way to do it.
What does it take to get the type of a qualified name, such as "Data.List.map"
? In a pipeline-based architecture we would just look it up in the type table. With queries, we have to think differently. Instead of relying on having updated some piece of state, we do it as if it was done from scratch.
As a first iteration, we do it completely from scratch. It might look a little bit like this:
fetchType :: QualifiedName -> IO Type
fetchType (QualifiedName moduleName name) = do
fileName <- moduleFileName moduleName
sourceCode <- readFile fileName
parsedModule <- parseModule sourceCode
resolvedModule <- resolveNames parsedModule
let definition = lookup name resolvedModule
inferDefinitionType definition
We first find out what file the name comes from, which might be Data/List.vix
for Data.List
, then read the contents of the file, parse it, perhaps we do name resolution to find out what the names in the code refer to given what is imported, and last we look up the name-resolved definition and type check it, returning its type.
All this for just for getting the type of an identifier? It seems ridiculous because looking up the type of a name is something we'll do loads of times during the type checking of a module. Luckily we're not done yet.
Let's first refactor the code into smaller functions:
fetchParsedModule :: ModuleName -> IO ParsedModule
fetchParsedModule moduleName = do
fileName <- moduleFileName moduleName
sourceCode <- readFile fileName
parseModule moduleName
fetchResolvedModule :: ModuleName -> IO ResolvedModule
fetchResolvedModule moduleName = do
parsedModule <- fetchParsedModule moduleName
resolveNames parsedModule
fetchType :: QualifiedName -> IO Type
fetchType (QualifiedName moduleName name) = do
resolvedModule <- fetchResolvedModule moduleName
let definition = lookup name resolvedModule
inferDefinitionType definition
Note that each of the functions do everything from scratch on their own, i.e. they're each doing a (longer and longer) prefix of the work you'd do in a pipeline. I've found this to be a common pattern in my query-based compilers.
One way to make this efficient would be to add a memoisation layer around each function. That way, we do some expensive work the first time we invoke a function with a specific argument, but subsequent calls are cheap as they can return the cached result.
This is essentially what we'll do, but we won't use a separate cache per function, but instead have a central cache, indexed by the query. This functionality is provided by Rock, a library that packages up some functionality for creating query-based compilers.
Rock is an experimental library heavily inspired by Shake and the Build systems à la carte paper. It essentially implements a build system framework, like make
.
Build systems have a lot in common with modern compilers since we want them to be incremental, i.e. to take advantage of previous build results when building anew with few changes. But there's also a difference: Most build systems don't care about the types of their queries since they work at the level of files and file systems.
Build systems à la carte is closer to what we want. There the user writes a bunch of computations, tasks, choosing a suitable type for keys and a type for values. The tasks are formulated assuming they're run in an environment where there is a function fetch
of type Key -> Task Value
, where Task
is a type for describing build system rules, that can be used to fetch the value of a dependency with a specific key. In our above example, the key type might look like this:
The build system has control over what code runs when we do a fetch
, so by varying that it can do fine-grained dependency tracking, memoisation, and incremental updates.
Build systems à la carte is also about exploring what kind of build systems we get when we vary what Task
is allowed to do, e.g. if it's a Monad
or Applicative
. In Rock, we're not exploring that, so our Task
is a thin layer on top of IO
.
A problem that pops up now, however, is that there's no satisfactory type for Value
. We want fetch (ParsedModuleKey "Data.List")
to return a ParsedModule
, while fetch (TypeKey "Data.List.map")
should return something of type Type
.
Rock allows us to index the key type by the return type of the query. The Key
type in our running example becomes the following GADT:
data Key a where
ParsedModuleKey :: ModuleName -> Key ParsedModule
ResolvedModuleKey :: ModuleName -> Key ResolvedModule
TypeKey :: QualifiedName -> Key Type
The fetch
function gets the type forall a. Key a -> Task a
, so we get a ParsedModule
when we run fetch (ParsedModuleKey "Data.List")
, like we wanted, because the return type depends on the key we use.
Now that we know what fetch
should look like, it's also worth revealing what the Task
type looks like in Rock, more concretely. As mentioned, it's a thin layer around IO
, providing a way to fetch
key
s (like Key
above):
newtype Task key a = Task { unTask :: ReaderT (Fetch key) IO a }
newtype Fetch key = Fetch (forall a. key a -> IO a)
The rules of our compiler, i.e. its "Makefile", then becomes the following function, reusing the functions from above:
rules :: Key a -> Task a
rules key = case key of
ParsedModuleKey moduleName ->
fetchParsedModule moduleName
ResolvedModuleKey moduleName ->
fetchResolvedModule moduleName
TypeKey qualifiedName ->
fetchType qualifiedName
The most basic way to run a Task
in Rock is to directly call the rules
function when a Task
fetches a key. This results in an inefficient build system that recomputes every query from scratch.
But the Rock
library lets us layer more functionality onto our rules
function, and one thing that we can add is memoisation. If we do that Rock caches the result of each fetched key by storing the key-value pairs of already performed fetches in a dependent hashmap. This way, we perform each query at most once during a single run of the compiler.
Another kind of functionality that can be layered onto the rules
function is incremental updates. When it's used, Rock keeps track of what dependencies a task used when it was executed (much like Shake) in a table, i.e. what keys it fetched and what the values were. Using this information it's able to determine when it's safe to reuse the cache from a previous run of the compiler even though there might be changes in other parts of the dependency graph.
This fine-grained dependency tracking also allows reusing the cache when a dependency of a task changes in a way that has no effect. For example, whitespace changes might trigger a re-parse, but since the AST is the same, the cache can be reused in queries that depend on the parse result.
Verifying dependencies can be too slow for real-time tooling like language servers, because large parts of the dependency graph have to be traversed just to check that most of it is unchanged even for tiny changes.
For example, if we make changes to a source file with many large imports, we need to walk the dependency trees of all of the imports just to update the editor state for that single file. This is because dependency verification by itself needs to go all the way to the root queries for all the dependencies of a given query, which can often be a large proportion of the whole dependency tree.
To fix this, Rock can also be made to track reverse dependencies between queries. When e.g. a language server detects that a single file has changed, the reverse dependency tree is used to invalidate the cache just for the queries that depend on that file by walking the reverse dependencies starting from the changed file.
Since the imported modules don't depend on that file, they don't need re-checked, resulting in much snappier tooling!
Most modern languages need to have a strategy for tooling, and building compilers around query systems seems like an extremely promising approach to me.
With queries the compiler writer doesn't have to handle updates to and invalidation of a bunch of ad-hoc caches, which can be the result when adding incremental updates to a traditional compiler pipeline. In a query-based system it's all handled centrally once and for all, which means there's less of a chance it's wrong.
Queries are excellent for tooling because they allow us to ask for the value of any query at any time without worrying about order or temporal effects, just like a well-written Makefile. The system will compute or retrieve cached values for the query and its dependencies automatically in an incremental way.
Query-based compilers are also surprisingly easy to parallelise. Since we're allowed to make any query at any time, and they're memoised the first time they're run, we can fire off queries in parallel without having to think much. In Sixty, the default behaviour is for all input modules to be type checked in parallel.
Lastly, I hope that this post will have inspired you to use a query-based compiler architecture, and given you an idea of how it can be done.
]]>I'm working on a reimplementation of Sixten, a dependently typed programming language that supports unboxed data. The reimplementation currently lives in a separate repository, and is called Sixty, though the intention is that it going to replace Sixten eventually. The main reason for doing a reimplementation is to try out some implementation techniques to make the type checker faster, inspired by András Kovács' smalltt.
In this post I'm going to show some optimisations that I implemented recently. I will also show the workflow and profiling tools that I use to find what to optimise in Haskell programs such as Sixty.
What set me off was that I was curious to see how Sixty would handle programs with many modules. The problem was that no one had ever written any large programs in the language so far.
As a substitute, I added a command to generate nonsense programs of a given size. The programs that I used for the benchmarks in this post consist of just over 10 000 lines divided into 100 modules that all look like this:
module Module60 exposing (..)
import Module9
import Module24
import Module35
import Module16
import Module46
import Module37
import Module50
import Module47
import Module46
import Module3
f1 : Type
f1 = Module46.f10 -> Module46.f20
f2 : Type
f2 = Module50.f24 -> Module47.f13
[...]
f30 : Type
f30 = Module37.f4 -> Module24.f24
Each module is about 100 lines of code, of which a third or so are newlines, and has thirty definitions that refer to definitions from other modules. The definitions are simple enough to be type checked very quickly, so the benchmark makes us focus mostly on other parts of the compiler.
I'd also like to write about the type checker itself, but will save that for other posts.
I use three main tools to try to identify bottlenecks and other things to improve:
bench is a replacement for the Unix time
command that I use to get more reliable timings, especially useful for comparing the before and after time of some change.
GHC's built-in profiling support, which gives us a detailed breakdown of where time is spent when running the program.
When using Stack, we can build with profiling by issuing:
Then we can run the program with profiling enabled:
This produces a file sixty.prof
that contains the profiling information.
I also really like to use ghc-prof-flamegraph to turn the profiling output into a flamegraph:
Threadscope is a visual tool for debugging the parallelism in a Haskell program. It also shows when the garbage collector runs, so can be used when tuning garbage collector parameters.
We start out on this commit.
Running sixty check
on the 100 module project on my machine gives us our baseline:
Time | |
---|---|
Baseline | 1.30 s |
The flamegraph of the profiling output looks like this:
Two things stick out to me in the flamegraph:
Data.Dependent.Map
take about 15 % of the time, and a large part of that is calls to Query.gcompare
when the map is doing key comparisons during lookups and insertions.Here's what a run looks like in ThreadScope:
And here's a more zoomed in ThreadScope picture:
I note the following in the ThreadScope output:
As we saw in the ThreadScope output, garbage collection runs often and takes a large part of the total runtime of the type checker.
In this commit I most notably introduce the RTS option -A50m
, which sets the default allocation area size used by the garbage collector to 50 MB, instead of the default 1 MB, which means that GC can run less often, potentially at the cost of worse cache behaviour and memory use. The value 50m
was found to be the sweet spot for performance on my machine by trying some different values.
The result of this change is this:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
The ThreadScope output shows that the change has a very noticeable effect of decreasing the number of garbage collections:
Also note that the proportion of time used by the GC went from 20 % to 3 %, which seems good to me.
Rock is a library that's used to implement query-based compilation in Sixty. I made two improvements to it that made Sixty almost twice as fast at the task:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
The changes are:
IORef
s and atomic operations instead of MVar
s: Rock uses a cache e.g. to keep track of what queries have already been executed. This cache is potentially accessed and updated from different threads. Before this change this state was stored in an MVar
, but since it's only doing fairly simple updates, the atomic operations of IORef
are sufficient.Applicative
context in parallel. The change here is to only trigger parallel query execution if both sides of an application of the <*>
operator do queries that are not already cached. Before this change even the cache lookup part of the queries was done in parallel, which is likely far too fine-grained to pay off.We can clearly see in ThreadScope that the parallelisation has a seemingly good effect for part of the runtime, but not all of it:
Unfortunately I didn't update Sixty in between the two changes, so I don't really know how much each one contributes.
I wasn't quite happy with the automatic parallelism since it mostly resulted in sequential execution. To improve on that, I removed the automatic parallelism support from the Rock library, and started doing it manually instead.
Code wise this change is quite small. It's going from
to
where pooledForConcurrently_
is a variant of forM_
that runs in parallel, using pooling to keep the number of threads the same as the number of cores on the machine it's run on.
Here are the timings:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Being able to type check modules in parallel on a whim like this seems to be a great advantage of using a query-based architecture. The modules can be processed in any order, and any non-processed dependencies that are missing are processed and cached on an as-needed basis.
ThreadScope shows that the CPU core utilisation is improved, even though the timings aren't as much better as one might expect from the image:
The flamegraph is also interesting, because the proportion of time that goes to parsing has gone down to about 17 % without having made any changes to the parser, which can be seen in the top-right part of the image:
This might indicate that that part of the compiler parallelises well.
Here's an experiment that only helped a little. As we just saw, parsing still takes quite a large proportion of the total time spent, almost 17 %, so I wanted to make it faster.
The parser is written using parsing combinators, and the "inner loop" of e.g. the term parser is a choice between a bunch of different alternatives. Something like this:
term :: Parser Term
term =
parenthesizedTerm -- (t)
<|> letExpression -- let x = t in t
<|> caseExpression -- case t of branches
<|> lambdaExpression -- \x. t
<|> forallExpression -- forall x. t
<|> var -- x
These alternatives are tried in order, which means that to reach e.g. the forall
case, the parser will try to parse the first token of each of the four preceding alternatives.
But note that the first character of each alternative rules out all other cases, save for (sometimes) the var
case.
So the idea here is to rewrite the parser like this:
term :: Parser Term
term = do
firstChar <- lookAhead anyChar
case firstChar of
'(' ->
parenthesizedTerm
'l' ->
letExpression
<|> var
'c' ->
caseExpression
<|> var
'\\' ->
lambdaExpression
'f' ->
forallExpression
<|> var
_ ->
var
Now we just have to look at the first character to rule out the first four alternatives when parsing a forall
.
Here are the results:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Parser lookahead | 0.442 s | -2 % |
Not great, but it's something.
At this point, around 68 % of the time goes to operations on Data.Dependent.Map
:
Note that this was 15 % when we started out, so it has become the bottleneck only because we've fixed several others.
Data.Dependent.Map
implements a kind of dictionary data structure that allows the type of values to depend on the key, which is crucial for caching the result of queries, since each query may return a different type.
Data.Dependent.Map
is implemented as a clone of Data.Map
from the containers
package, adding this key-value dependency, so it's a binary tree that uses comparisons on the key type when doing insertions and lookups.
In the flamegraph above we can also see that around 21 % of the time goes to comparing the Query
type. The reason for this slowness is likely that queries often contain strings, because most are things like "get the type of [name]". Strings are slow to compare because you need to traverse at least part of the string for each comparison.
It would be a better idea to use a hash map, because then the string usually only has to be traversed once, to compute the hash, but the problem is that there is no dependent hash map library in Haskell. Until now that is. I implemented a dependent version of the standard Data.HashMap
type from the unordered-containers
as a thin wrapper around it. The results are as follows:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Parser lookahead | 0.442 s | -2 % |
Dependent hashmap | 0.257 s | -42 % |
Having a look at the flamegraph after this change, we can see that HashMap
operations take about 20 % of the total run time which is a lot better than 68 % (though there's still room for improvement). We can also see that the main bottleneck is now the parser:
ReaderT
-based Rock libraryHere's one that I did by ear, since it wasn't obvious from the profiling.
I mentioned that the Rock library used to support automatic parallelisation, but that I switched to doing it manually. A remnant from that is that the Task
type in Rock is implemented in a needlessly inefficient way. Task
is a monad that allows fetching queries, and is used throughout most of the Sixty compiler.
Before this change, Task
was implemented roughly as follows:
newtype Task query a = Task { unTask :: IO (Result query a) }
data Result query a where
Done :: a -> Result query a
Fetch :: query a -> (a -> Task query b) -> Result query b
So to make a Task
that fetches a query q
, you need to create an IO
action that returns a Fetch q pure
. When doing automatic parallelisation, this allows introspecting whether a Task
wants to do a fetch, such that independent fetches can be identified and run in parallel.
But actually, since we no longer support automatic parallelisation, this type might as well be implemented like this:
newtype Task query a = Task { unTask :: ReaderT (Fetch query) IO a }
newtype Fetch query = Fetch (forall a. query a -> IO a)
The ReaderT
-based implementation turns out to be a bit faster:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Parser lookahead | 0.442 s | -2 % |
Dependent hashmap | 0.257 s | -42 % |
ReaderT in Rock |
0.245 s | -5 % |
Let's have a look at the flamegraph at this point in time:
The parser now takes almost 30 % of the total run time. The parser is written using parser combinators that work directly on characters, so it's also doing tokenisation, or lexing, on the fly.
I've been wondering about the performance impact of this practice, since it's quite common in the Haskell world. So the change I made here is to write a faster lexer that's separate from the parser, and then make the parser work on the list of tokens that the lexer spits out.
This turns out to be a great idea:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Parser lookahead | 0.442 s | -2 % |
Dependent hashmap | 0.257 s | -42 % |
ReaderT in Rock |
0.245 s | -5 % |
Separate lexer | 0.154 s | -37 % |
The "inner loop" of the parser that I tried optimising in the "Parser lookahead" step has now become a case expression on the next token, visible here. Essentially, it's gone from matching on characters to this:
term :: Parser Term
term = do
token <- getNextToken
case token of
Lexer.LeftParen -> parenthesizedTerm
Lexer.Let -> letExpression
Lexer.Identifier ident -> pure (Var ident)
Lexer.Case -> caseExpression
Lexer.Lambda -> lambdaExpression
Lexer.Forall -> forallExpression
Lexer.Number int -> pure (Lit int)
The flamegraph at this point contains mostly things I don't really know what to do with, but there's one thing left, and that's hashing of queries, which now takes just short of 18 % of the total runtime:
The change I made here is to write some Hashable
instances by hand instead of deriving them, and to add couple of inlining pragmas. This gives a 5 % speedup:
Time | Delta | |
---|---|---|
Baseline | 1.30 s | |
RTS flags | 1.08 s | -17 % |
Rock | 0.613 s | -43 % |
Manual parallelisation | 0.451 s | -26 % |
Parser lookahead | 0.442 s | -2 % |
Dependent hashmap | 0.257 s | -42 % |
ReaderT in Rock |
0.245 s | -5 % |
Separate lexer | 0.154 s | -37 % |
Faster hashing | 0.146 s | -5 % |
The new flamegraph shows that query hashing is now down to around 11 % of the time.
I was able to make the Sixty compiler nine times faster for this benchmark by using the excellent profiling tools that we have for Haskell. There's no reason to be optimising in the dark here.
As a reminder, here's what the compiler looked like in ThreadScope to start with:
Here's where we're at now:
It looks faster and it is faster.
]]>