(c) 2016 Cezary Kaliszyk, François Chollet, Christian Szegedy

### Download the dataset: HolStep.tgz (114MB).

The dataset is released under the BSD 3 Clause license (see LICENSE file packaged with the data).

### Read the paper (accepted for publication at ICLR 2017)

### Download baseline code
and code used to generate the dataset.

## Abstract

Large computer-understandable proofs consist of millions of intermediate
logical steps. The vast majority of such steps originate from manually
selected and manually guided heuristics applied to intermediate goals.
So far, machine learning has generally not been used to filter or
generate these steps. In this paper, we introduce a new dataset based on
higher-order logic (HOL) proofs, for the purpose of developing new
machine learning-based theorem-proving strategies. We make this dataset
publicly available under the BSD license. We propose various machine
learning tasks that can be performed on this dataset, and discuss their
significance for theorem proving. We also benchmark a set of baseline
deep learning models suited for the tasks (including convolutional
neural networks and recurrent neural networks). The results of our
baseline models shows the promise of applying deep learning to HOL
theorem proving.

## Dataset Contents

The dataset contains 9999 training files and 1411 testing files.

Each file in the dataset contains a number of training examples
together with the conjecture and dependencies on which the training
example may be conditioned.

There are 2013046 training examples and 196030 testing examples in total.

Each file consists of a conjecture block, a number of dependency
blocks, and a number of training/testing example blocks.

The conjecture block starts with an 'N' and consists of 3 lines:

N <name of conjecture>
C <text representation of the conjecture>
T <tokenization of the conjecture>

Each dependency block starts with a 'D' and consists of 3 lines:

D <name of dependency>
A <text representation of the dependency>
T <tokenization of the conjecture>

Each training/testing example starts with the symbol '+' or '-'
where '+' means useful in the final proof and '-' not useful
and consists of 2 lines:

+ <text representation of the intermediate step>
T <tokenization of the intermediate step>