Greetings to Boston from the Tyrolean mountains.
My name is Michael Färber, and I am happy to talk to you about "Terms for Efficient Proof Checking and Parsing".

A while ago, I worked on exporting proofs from the ITP Isabelle.
The generated proofs turned out to be quite large,
so I spent large amounts of time waiting for
a proof checker to certify my generated proofs.
Trying to improve the performance of proof checking,
I decided to research parallelism.
However, it turned out to be quite difficult to
achieve both high single- and multi-threaded performance,
while preserving a small kernel.

To better understand the problem, let us have a look at how the proof checker works.
It reads a sequence of theorems from a file and processes them,
where a theorem is a statement along with its proof.
When I started my work, the proof checker only supported sequential processing:
that is, for each theorem parsed from a file,
we check its proof completely before continuing with the next theorem.
I quickly found that parsing consumed a considerable amount of time,
so I tried to parallelise parsing first.
For this, we create a distinct parsing thread,
which sends parsed theorems via a channel to the main thread,
which processes theorems like above.
In theory, this could reduce the overall runtime by the time required for parsing.

Next, I looked into parallel checking.
Here, we check proofs in separate threads.
That means that we do not wait for a proof to be checked before processing the next theorem.
This can lead multiple proofs to be parsed and checked simultaneously.
However, the main difficulty turned out to be how to efficiently check proofs in different threads.

At CPP'22, I already presented a proof checker called Kontroli written in Rust.
It reimplements large parts of the Dedukti proof checker,
however, unlike Dedukti, Kontroli supports parallel parsing and checking as shown before.
I was able to show that parallel checking with Kontroli improved performance over Dedukti.
Still, that work had a few shortcomings.
In particular, Kontroli used
different kernels for single- and multithreaded checking to achieve high performance, and
passing from the single- to the multi-threaded kernel incurred a significant performance overhead.
Furthermore, parallel checking was far from reaching the theoretical optimal performance.
In this work, I remedy both shortcomings.
In particular, I will show heterogeneous terms to improve checking performance.
In my paper, I also show abstract terms to improve parsing performance,
but due to time constraints, I will not talk about it here.
I will show that the combination of these two techniques reduces overall runtime by up to 3.6 times.

I will now talk about the traditional way of modelling terms, which I call "homogeneous terms".

Terms are *the* central data structure in a proof checker like ours,
so the way we model terms has a large impact on performance.
In our case, a term is either a constant, a variable, an application, or a lambda- or Pi-abstraction.
We can model such terms in a functional programming language quite straightforwardly by a sum type;
here, constants are represented by strings, variables by integers that stand for de Bruijn indices,
and applications, lambda-abstractions, and dependent products contain their respective subterms.

In a more low-level programming language like Rust, however,
we have to use pointers to achieve inductive types.
Because there are multiple pointer types,
we have more degrees of freedom to model inductive types,
but on the other hand, we have to choose wisely to obtain good performance.
We can distinguish pointer types in Rust based on three properties, namely
thread-safety, which means whether we can use pointers across threads,
sharing, which means whether two pointers can point to the same object in memory, and
speed, which means how fast we can create and destroy pointers.
Here, I show three pointer types, where each satisfies two of the three properties.
For example, `Box` is thread-safe and fast, but it does not support sharing.
`Rc` stands for "reference-counting", and `Arc` stands for atomic reference-counting.
The ampersand in the middle is a reference type that combines all three properties,
but it requires us programmers to prove in the code that
the object it points to remains in memory at least as long as the reference itself.
I will show you occurrences of all these pointer types in the remainder of this talk.

Now let us have a first go at modelling terms in Rust.
The term type on this slide models constants and variables similarly to what we saw before,
but for applications and abstractions, we now use `Box` pointers
to wrap inductive occurrences of the term type.
This is shown by the diagram on the right, which means that when we have a term,
we have to pass through a `Box` in order to arrive at its subterms.
There are two problems with this term type:
First, using `Box` as a pointer type means that duplicating a term requires linear time.
This is not acceptable for proof checking, because checking frequently duplicates terms.
On the other hand, it is OK for other applications like proof parsing, because
this does not duplicate any terms.
The second problem is that abstractions use two `Box` pointers,
but actually a single pointer to a pair of terms would be sufficient.

So let us now address both problems with a new term type.
Here, I factor out the inductive variants of the type
and model them with a new type that I call `Comb` for "combinator",
which is polymorphic over the type of terms that it contains.
This leaves us with the new term type `LTerm`:
It models constants and variables similarly to before, but
it now contains a variant called `LComb` that stores an
atomically reference-counting pointer to an `LTerm` combinator.
That means that when we have an `LTerm`,
we have to pass through an `Arc` to obtain a combinator,
from which we directly obtain its subterms
without passing through another pointer type.
I used this term type in my last year's CPP paper.
However, it has one principal problem, namely that
creating many terms that contain `Arc` is quite slow.
Furthermore, when we are checking proofs sequentially,
then `Arc` only costs us overhead, but does not bring any advantages.
This is why we have a second kernel for sequential checking that
differs from the first one only by
using thread-unsafe `Rc` pointers instead of `Arc` for better performance.
Still, this is quite cumbersome.

So let me now address these problems with a new approach, namely heterogeneous terms.

The basic observation behind heterogeneous terms is that
we have two complementary structures in our proof checker.
The first one is the global context Gamma, which stores background knowledge,
such as the theorems that we have processed so far.
The second one is the local context Delta, which stores knowledge that is
only relevant to the current proof checking task, such as the current proof.
The terms contained in these two structures differ on the grounds of several properties:

* The terms in Gamma contain mostly constant types and their definitions, whereas
  the terms in Delta contain proofs as well as temporary terms created during proof checking.
* The terms in Gamma live until the proof checker terminates, whereas
  the terms in Delta live only until a proof has been checked.
* The terms in Gamma are relatively few and their number is bounded by the input, whereas
  the terms in Delta can be many and their number cannot be bounded a priori.
* The terms in Gamma can be accessed from multiple threads,
  in particular when we are checking multiple proofs at the same time, whereas
  the terms in Delta are only accessed from a single proof checking thread.

These differences suggest that we can make separate types for terms in Gamma and Delta.
To name these terms, we can remark that Gamma and Delta resemble
long- and short-term memory, which has been pointed out by Gilles Dowek,
to whom I would like to express my gratitude for this observation.
This is why I call terms in Gamma long terms and terms in Delta short terms.

Let us now look at how we can implement heterogeneous terms.
For long terms, we can directly use the `LTerm` type that I showed before.
For short terms, let us go for now with the type `STerm` here.
It closely resembles `LTerm`, but it
uses the thread-unsafe `Rc` type where `LTerm`
uses the thread-safe  `Arc` type.
Using these two types to model terms, we have the problem that
converting an `LTerm` to an `STerm`,
which happens very frequently during proof checking, takes linear time.
That is because we cannot easily reference a long term from a short term.

Let us address this problem by adapting the implementation of short terms.
This version of `STerm` now adds another variant, namely `LComb`, which contains
a reference to a long term combinator.
This representation has a number of desirable properties:
First, converting a long term to a short term now can be done in constant time,
because constants and variables can be converted trivially, and
long term combinators can be stored in a short term using a reference.
Furthermore, short terms that reference long terms
can be created and deleted very efficiently, because
references can be created and deleted without any reference counting.
The only potential disadvantage is that we cannot forget long terms
while there are still short terms that reference them.
However, this does not matter in Kontroli, because the set of long terms grows monotonically.

Let us now have a look at the performance of homogeneous and heterogeneous terms.

For this, I evaluated Dedukti and Kontroli using a dataset of proofs exported from Isabelle/HOL,
which is 2.5GB large and contains 1.7 million proofs.
First, I show you the results for Dedukti:
With sequential proof checking, Dedukti takes 534 seconds to check the whole dataset,
of which 254 seconds are spent only on parsing.
This demonstrates that parsing can indeed take a significant part of total runtime.
Next, I show you the results for different configurations of Kontroli, where
blue bars stand for last year's version using   homogeneous terms for checking and
 red bars stand for this year's version using heterogeneous terms for checking.
The first configuration performs sequential processing, similar to Dedukti.
We can see that while last year's Kontroli was roughly as fast as Dedukti,
this year's Kontroli is more than twice as fast.
Next, I show you the results for parallel checking, where
I vary the number of checking threads between 1 and 4.
We can see that using one thread incurs a significant overhead for last year's Kontroli.
This stems from the fact that here,
we use the multi-threaded kernel, which uses slower `Arc` pointers,
whereas the single-threaded kernel uses faster `Rc` pointers.
In contrast, the performance of this year's Kontroli remains the same,
because both single- and multi-threaded checking use the same kernel.
Increasing the number of checking threads decreases runtime for both last year's and this year's Kontroli.
Now, let us see what happens when we additionally use parallel parsing.
For last year's Kontroli, this actually increases runtime, whereas
for this year's Kontroli, the runtime decreases.
This is because this year's Kontroli uses abstract terms for parsing.
The next configuration shows how long Kontroli takes without actually checking proofs.
This configuration marks the theoretical optimum for parallel checking.
We can see that with combined parallel parsing and checking,
this year's Kontroli gets very close to this theoretical optimum,
whereas last year's Kontroli is quite far from it.
Finally, the last configuration shows the runtime for Kontroli just parsing its input.
Here, we see that performance has improved as well,
and this year's Kontroli is more than 5 times as fast as Dedukti.

This brings me to the conclusion of this talk.

I showed you homogeneous terms, which are a good match for parsing,
because parsing does not involve cloning any terms.
For checking, however, heterogeneous terms that distinguish
long and short terms are a better choice.
In particular, they allow to efficiently reference long terms in short terms
because this does not involve any reference counting.
Furthermore, they enable both sequential and parallel proof checking
with a single kernel, without incurring any overhead.

Thank you for your attention.