An Evaluation Algorithm for Datalog with Equality
Abstract
We describe an evaluation algorithm for relational Horn logic (RHL).
RHL extends Datalog with quantification over sorts, existential quantification in conclusions and, crucially, the ability to infer equalities.
These capabilities allow RHL evaluation to subsume and expand applications of Datalog and congruence closure algorithms.
We explain how aspects of a fast congruence closure algorithm can be incorporated into Datalog evaluation to obtain an efficient RHL evaluation algorithm.
We then sketch how Steensgaard's points-to analysis and type inference can be implemented using RHL evaluation.
RHL and the evaluation algorithm described here are the foundation of the Eqlog Datalog engine.
1 Introduction
Recent work has identified
relational Horn logic (RHL) and
partial Horn logic (PHL) as a semantically well-behaved extensions of Datalog [
7].
This paper describes how the Datalog evaluation algorithm can be generalized to RHL evaluation.
An RHL theory contains declarations of the following data:
-
A set
of
sorts.
-
A set
of
relations.
-
Arities for all relations
.
-
A set of
sequents (or
rules, or
axioms), each of the form
, where
and
are conjunctions
of
atoms .
Instead of RHL sequents as above, Datalog engines typically accept rules of the following form:
The
head corresponds to the conclusion of an RHL sequent and consists of a single atom.
The
body corresponds to the premise of an RHL sequent, can contain multiple atoms and is interpreted as conjunction.
The structure of RHL sequents is thus more general at first sight because the conclusion of an RHL sequents is allowed to be a conjunction of atoms.
However, a single sequent with
conclusion atoms is equivalent to
sequents, each with a single conclusion atom.
Thus, no generality is gained solely from allowing general conjunctions as conclusions.
Where RHL generalizes Datalog is in what kind of atoms it allows, and how variables are handled.
In Datalog, each atom is of the form
where
is a relation symbol and the
are variables whose sort match the arity of
.
In addition to such
relation atoms, RHL recognizes also the following types of atoms:
-
An
equality atom
where
and
have the same sort.
We reserve the symbol
for RHL syntax, whereas
is used for meta-theoretical equality.
-
A
sort quantification
where
is a variable with known sort
.
If the sort of
is not determined by other atoms in the sequent, we also use the syntax
as synonymous for
and the metatheoretical assertion that
has sort
.
If an equality atom
occurs in the premise of a sequent, then matches of the premise are only valid if
and
are interpreted as the same constant.
Thus, equality atoms in a premise can be eliminated by removing all occurrences of one of the two variables in the sequent by the other variable.
The semantics of an equality atom
in the conclusion of a sequent are non-trivial, however:
Whenever the premise of such a sequent matches such that
is interpreted as a constant
and
is interpreted as constant
, then we expect the evaluation engine to identify
and
in all contexts henceforth.
For example, the premise of the transitivity axiom
should match tuples
if an equality
has been inferred previously.
Partial functions can be encoded in RHL using relations representing the graphs of partial functions.
Thus one identifies partial functions
with relations
where the first
components of each entry represent an element in the domain of the function and the last component represents the value of the function.
The
functionality axiom
(1)
(1)
enforces that the relation
does indeed correspond to a well-defined partial function.
Sort quantifications
in premises allow matching elements of a given sort
that do not appear in any relation.
In standard Datalog, all variables in the head of a sequent must also appear in the body.
This requirement is removed in RHL.
Variables that only appear in the conclusion are implicitly existentially quantified.
If the premise of a sequent matches, then the evaluation engine must extend the match to the conclusion by finding an interpretation of variables that occur only in the conclusion such that the conclusion holds.
If no such extension exists, then the evaluation engine must create new identifiers to interpret variables that only occur in the conclusion, and enforce that the atoms of the conclusion hold for this interpretation.
We expect the evaluation engine to output a list of identifiers of each sort, including those identifiers that were created during evaluation.
An RHL sequent in which all conclusion variables also occur in the premise is called
surjective.
If an RHL theory contains non-surjective sequents, then evaluation need not terminate.
The Souffle Datalog engine implements a similar mechanism in the form of its choice construction [
2].
The presence of non-surjective sequents can not only lead to non-termination but also to non-deterministic results, in the sense that the result depends on the order in which sequents are matched.
This is not the case for
strong RHL theories, in which the interpretion of conclusion variables is uniquely determined once all sequents are satisfied.
For example, the RHL theory given by the functionality sequent
(3) and the sequent
(2)
(2)
is strong, since if the functionality axiom is satisfied, then the interpretation of the variable
in sequent
(2) is uniquely determined.
Unfortunately, it is undecidable whether a given RHL theory is strong.
A further problem with RHL is that functions must be encoded via their graphs.
This leads to excessive verbosity when formulating axioms involving complex expressions built up from function symbols.
Partial Horn logic (PHL) [
5] is a syntactic layer on top of RHL that rectifies these shortcomings.
In partial Horn logic, relations must be explicitly declared as predicates or partial functions.
Predicates correspond directly to RHL relations, whereas functions are lowered into a relation corresponding to the graph of the function and the implicit functionality axiom
(3).
In positions where RHL accepts variables, PHL also allows composite terms
which are recursively defined from variables and application of partial function symbols to terms whose sorts match the function signature.
When lowering PHL to RHL, composite terms
are recursively lowered into a result variable representing the term
and additional RHL atoms.
These additional atoms are inserted into premise or conclusion of the sequent, depending on where
appears.
To lower a composite term
, we may assume that
for variables
by recursively lowering the arguments
first.
We now choose a fresh variable
representing
and add the RHL atom
.
Since lowering a PHL formula reduces nested expressions into a flat sequence of atoms, the process of lowering PHL to RHL is also called
flattening.
PHL sequents are
epic if all variables in the conclusion are already introduced in the premise.
Note that lowering epic PHL sequents can result in non-surjective RHL sequents, because lowering composite terms can introduce fresh variables.
Nevertheless, the lowered RHL theory resulting from an epic PHL theory (i.e. a PHL theory containing epic sequents only) is strong.
Conversely, every strong RHL theory is equivalent to an epic PHL theory; see [
7, Section 4.3] for details.
Thus, epic PHL has the same descriptive strength as strong RHL.
On the other hand, checking whether a PHL sequent is epic is trivial, whereas checking whether an RHL theory is strong is undecidable.
This makes PHL more suitable as human-facing input language to an evaluation engine compared to RHL.
The
Eqlog engine, whose underlying algorithm is described in this paper, accepts epic PHL as input language.
Eqlog lowers a user-provided epic PHL theory to RHL, which is then transpiled to a Rust module.
This is similar to the Souffle Datalog engine, which transpiles Datalog to C++.
In contrast to Souffle, Eqlog is meant to be used as part of a larger project and not as standalone tool.
Similarly to the
egg equality saturation library [
11], Eqlog supports online use-cases in which one alternates inserting new ground facts into the model and closing the model under PHL sequents.
Refer to the Eqlog project homepage [
1] for details.
Independently of the work on Eqlog presented in this article, members of the Egg community have created a very similar tool that combines e-graphs with Datalog, which will be reported on in an upcoming article.
Outline.
In Section
2, we describe a basic algorithm to evaluate RHL, and a technique to detect termination in many circumstances.
Then, in Section
3, we discuss optimizations of the RHL evaluation algorithm.
Finally, in Section
4, we sketch applications of PHL and RHL evaluation to the implementation of programming languages.
Section
5 concludes.
Acknowledgements.
Jakob Botsch Nielsen contributed significantly to previous versions of the implementation of the Eqlog tool and its application to an experimental type checker for a dependently typed proof assistant.
My own work on Eqlog and the algorithm presented in this paper commenced during my time as PhD student at Aarhus University, where I was advised by Bas Spitters and supported by the Air Force Office and Scientific Research project “Homotopy Type Theory and Probabilistic Computation”, grant number 12595060.
2 RHL Evaluation
In this section, we describe a basic algorithm to evaluate RHL theories.
The input to the evaluation algorithm is a list of RHL sequents and a
relational structure representing ground facts.
A relational structure is given by sets of numerical identifiers for each sort, representing the elements of the sort, and sets of tuples for each relation.
From Section
2.2 onward, we assume that relational structures contain also union-find data structures for each sort, representing semantic equality of sort elements.
If RHL evaluation terminates, then we require that the output is a relational structure that is
weakly free with respect to the list of RHL sequents over the input relational structure [
7].
Intuitively, this means that the output must satisfy all sequents in the RHL theory, and that it must be obtained from the input relational structure only from matching sequent premises and adjoining data corresponding to conclusions.
Weak freeness does not uniquely characterize a relational structure.
In general, the output relational structure depends on the order of matching premises of RHL sequents.
However, if the RHL theory is strong, then the output relational structure is
(strongly) free over the input relational structure, which determines it uniquely up to unique isomorphism (i.e. renaming of identifiers).
Relevant classes of strong RHL theories are theories containing surjective sequents only, and theories that are obtained from lowering epic PHL theories [
7].
In Section
2.1, we discuss the naive algorithm for Datalog evaluation and amend it with support for non-surjective sequents and sort quantification, but not equality.
Then, in Section
2.2, we consider a well-known simple congruence closure algorithm, which we shall understand as a special-purpose algorithm for evaluation of functionality RHL sequents.
The conclusion of functionality sequents is an equality, which our naive Datalog evaluation algorithm cannot process.
Union-find data structures and
normalization are aspects of this congruence closure algorithm that deal with equalities in particular.
We incorporate these into our naive Datalog algorithm to obtain a naive RHL evaluation algorithm.
In Section
2.3, we discuss an example where our RHL evaluation algorithm does not terminate for a non-surjective RHL theory with finite free models.
Based on the example, we show how the evaluation algorithm can be modified to terminate for this particular example and also a wide range of other RHL theories.
2.1 Naive Datalog evaluation
The naive Datalog algorithm is given by repeating
premise matching and
conclusion application phases until a fixed point is reached.
The high-level structure of the algorithm can be expressed in Rust-like pseudo-code as follows:
fn datalog(structure, sequents) {
loop {
// 1. Match premises.
let matches = [];
for sequent in sequents {
matches.push(find_matches(structure, sequent.premise));
}
// 2. Apply conclusions.
let has_changed = false;
for (sequent, matches) in sequents.zip(matches) {
for match in matches {
if apply_conclusion(structure, sequent.conclusion, match) {
has_changed = true;
}
}
}
// Terminate if applying conclusions had no effect.
if !changed {
break;
}
}
return structure;
}
find_matches
is a subprocedure that returns a list of matches of the given formula in a relational structure.
Each match is given by a mapping from the set of variables that occur in the formula to elements in the relational structure.
A naive implementation of this function enumerates matches using a nested loop join.
For example, matches of the formula
can be enumerated as follows:
for (u, v) in structure.rels[Le] {
for (w, v1) in structure.rels[Ge] {
if v1 != v { continue; }
for (w1, x) in structure.rels[Le] {
if w1 != w { continue; }
matches.push({u, v, w, x});
}
}
}
Each relational atom translates into a nested loop over the corresponding relation in the relational structure, and each sort quantification translates into a loop over the corresponding list of elements.
apply_conclusion
is a subprocedure that inserts data into a relational structure according to a provided conclusion and a substitution of variables for elements in the relational structure.
It returns a boolean value indicating whether the operation had an effect, i.e. whether at least some of the concluded data was not already present in the relational structure.
For surjective sequents without equalities, where every variable in the conclusion is already bound by a match of the premise, we substitute the variables in each relation atom and insert the corresponding tuple into the relational structure.
For non-surjective sequents, we first check if the provided substitution of premise variables can be extended to interpretations of the conclusion variables such that the conclusion holds.
This can be accomplished using a version of the find_matches
function that takes a list of already fixed interpretations of some of the variables in the formula.
If no such extension exists, then we adjoin fresh elements to the relational structure to interpret the unbound conclusion variables and proceed as in the surjective case.
2.2 Congruence closure and naive RHL evaluation
RHL equality atoms can be reduced to Datalog by introducing binary equality relations on each sort representing inferred equality.
However, this
setoid reduction [
7, Section 3.4] typically leads to inefficient Datalog programs.
Semantically, an inferred equality often reduces the size of the relational structure, since equalities can collapse two previously distinct tuples in a relation into a single tuple.
Instead, the setoid reduction leads to significant duplication due to congruence axioms, which assert that all relations must be closed in each argument under inferred equality.
Our goal in this section is to rectify this deficiency:
Every inferred equality should only shrink the relational structure.
Observe that RHL can be used to solve, in particular, the congruence closure problem:
Decide which equalities among a list
of expression follow from a list of equalities among subexpressions.
This problem can be encoded in RHL with a theory given by an
-ary relation symbol
representing the graph of an
-ary function symbol that occurs in the
and the
functionality axiom
(3)
(3)
One then inserts data corresponding to the
into a relational structure, imposes equalities among subexpressions, and closes the structure under functionality axioms.
We may thus understand congruence closure algorithms as special-purpose evaluation algorithms for functionality RHL sequents, and try to generalize existing congruence closure algorithms to general RHL evaluation.
Our inspiration here is the congruence closure algorithm described in [
3].
Consider the following version of their naive algorithm 2.1, which they attribute to [
8].
The version presented here is specialized to a single binary function.
The input of the algorithm is a list of triples representing the graph of the function.
fn congruence_closure(graph) {
uf = UnionFind::new();
loop {
// 1. Match premises.
let eqs = [];
for (x0, x1, x2) in graph {
for (y0, y1, y2) in graph {
if x0 == y0 && x1 == y1 {
eqs.push((x2, y2));
}
}
}
// 2. Apply equalities.
let has_changed = false;
for (lhs, rhs) in eqs {
lhs = uf.find(lhs);
rhs = uf.find(rhs);
if lhs != rhs {
uf.union(lhs, rhs);
has_changed = true;
}
}
// Terminate if nothing has changed.
if !has_changed {
break;
}
// 3. Normalize.
graph0 = [];
for (x0, x1, x2) in graph {
graph0.push(uf.find(x0), uf.find(x1), uf.find(x2));
}
graph = graph0;
}
return uf;
}
Similarly to the Datalog evaluation algorithm of Section
2.1, this congruence closure algorithm repeats a number of steps until it reaches a fixed point.
Step 1 corresponds to the
find_matches
function for the premise of the functionality axiom
(3) of a binary function.
Step 2 applies the conclusions
for each match that was found in step 1.
The algorithm uses a union-find data structure to represent equality.
A union-find data structure associates a canonical representative to each equivalence class.
Equivalence classes are equal if and only if they have the same canonical representative.
Union-find data structures support fast find
and union
operations in near-constant runtime.
The find
operation computes the canonical representative in the equivalence class of a given element.
The union
operation merges the equivalence classes of two canonical representatives.
Step 3, which replaces all elements in entries of the
graph
relation by canonical representatives, does not have a counterpart in Datalog evaluation.
Because of the use of the union-find data structure, only comparisons among canonical representatives reflect inferred equality.
Note that, instead of the normalization step, we could also consult the union-find data structure in step 1 during premise matching when comparing elements.
However, a separate normalization step makes the use of a number of optimizations possible, which we discuss in Section
3.
By incorporating aspects of the congruence closure algorithm that deal with equalities into the naive Datalog evaluation algorithm of Section
2.1, we now obtain our
naive RHL evaluation algorithm:
-
In addition to sets of elements and relation, relational structures now contain also union-find data structures for each sort, representing semantic equality.
We maintain the invariant that the relations in the relational structure contain canonical representatives only before each iteration of the evaluation algorithm.
-
apply_conclusion
handles equalities
by merging the equivalence classes of the interpretations of
and
with a call to
union
.
-
Before the end of the loop body, we insert a normalization step, which replaces each element in a tuple in any relation with its canonical representative by calling find
.
Since relational structures store relations as sets without duplication, normalization can potentially reduce the number of elements of relations.
The Souffle Datalog engine provides an efficient implementation of equivalence relations using union-find data structures [
4], but it does not implement normalization.
2.3 Detecting termination
If all sequents in the RHL theory to be evaluated are surjective, then the algorithm we have described in Section
2.2 is guaranteed to terminate.
For non-surjective sequents, however, the (weakly) free model over a finite structure need not be finite.
In these situations, RHL evaluation can thus not terminate and must instead be aborted after a timeout is exceeded or a desired property has been inferred.
Nevertheless, there are RHL theories for which free models over finite relational structures are again finite despite the presence of non-surjective sequents.
But even in these situations, the RHL evaluation algorithm we have discussed so far need not terminate.
Consider, for example, the following PHL theory that axiomatizes pairs of maps
such that
for all
:
-
-
-
Axiom
3 is lowered to the RHL axiom
, which is surjective.
Axioms
1 and
2, however, are non-surjective.
Nevertheless, free models of this theory over finite relational structures are always finite, as the following construction of free models proves:
Given sets
and relations
, one first identifies elements within
and
according to the functionality axioms for
and
and axiom
3.
For each element
on which
is not defined, we then adjoin new elements to
and extend
accordingly to a total function by axiom
2.
Similarly, we then adjoin for each
on which
is not defined new elements to
and extend
accordingly to a total function by axiom
1.
Now every element in
on which
is not defined is of the form
for some unique
, so by axiom
3, we may extend
to a total function by setting
.
Now
and
are total functions and
is the identity function on
.
On first thought, we might thus hope RHL evaluation for this theory to eventually reach a fixed point and terminate.
Unfortunately, this is not the case for the RHL evaluation algorithm described in Section
2.2.
Consider the iterations of evaluation with initial relational structure given by
,
and
and
entirely undefined:
-
Axiom
1 matches on
, resulting in a new element
and the tuple
.
-
Axiom
2 matches on
, and axiom
3 matches on
.
The former results in a new element
and the tuple
, while the latter results in the tuple
.
-
Axiom
1 matches on
, and the implicit functionality axiom for
matches on
.
The former results in a new element
and the tuple
, while the latter results in the equality
.
-
Axiom
2 matches on
, and the implicit functionality axiom for
matches on
.
The former results in a new element
and the tuple
, while the latter results in the equality
.
-
All further iterations alternate between variations of iterations
3 and
4.
What prevents termination for this theory is thus that the evaluation algorithm matches all sequents at once:
Observe that our proof that free models are finite relies on applying sequents in some particular order.
For this particular theory, non-termination can be prevented by carefully stating axioms so as to avoid alternating states, for example by replacing axioms
1 and
3 with
.
However, the following variant of the RHL evaluation algorithm described in Section
2.2 terminates on a wide range of RHL theories, including the theory above.
One splits the top-level evaluation loop into an inner loop responsible for surjective sequents and an outer loop responsible for non-surjective sequents.
The algorithm thus alternates closing the relational structure under all surjective sequents (which always terminates) and a single step of matching and adjoining conclusions of non-surjective sequents.
If eventually a non-surjective step does not change the relational structure, then all sequents are satisfied and evaluation terminates.
However, I expect there to be theories where termination depends on a particular order in which non-surjective sequents are applied, and then this simple approach does not suffice.
3 Optimizations
In this section, we consider optimizations of the naive RHL algorithm that we discussed in Section
2.2.
Most of these techniques are adapted from optimizations that apply to Datalog evaluation, to the congruence closure problem or to both.
Implemented together, these optimizations allow us to recover the fast congruence closure algorithm due to [
3] as a special case of RHL evaluation for functionality axioms.
3.1 Semi-naive evaluation
Semi-naive evaluation is a common Datalog evaluation optimization.
It exploits the observation that matches of premises that were found in the previous iteration need not be considered again because conclusions to these matches have already been adjoined.
A match has not been found in a previous iteration if at least one of the atoms in the premise is matched with new data, i.e. data was added only in the last iteration.
To distinguish old data from new data, we store for each relation and each sort lists of tuples or elements that were added in the last iteration.
An
-fold nested loop that matches the premise of a sequent can now be replaced with
copies, where in the
th copy the
th loop iterates over new data only.
For example, the nested loop described in Section
2.1 enumerating the premise of
can be replaced by the following three nested loops:
for (u, v) in structure.rels_new[Le] {
for (w, v1) in structure.rels_all[Ge] {
if v1 != v { continue; }
for (w1, x) in structure.rels_all[Le] {
if w1 != w { continue; }
matches.push([u, v, w, x]);
}
}
}
for (u, v) in structure.rels_all[Le] {
for (w, v1) in structure.rels_new[Ge] {
if v1 != v { continue; }
for (w1, x) in structure.rels_all[Le] {
if w1 != w { continue; }
matches.push([u, v, w, x]);
}
}
}
for (u, v) in structure.rels_all[Le] {
for (w, v1) in structure.rels_all[Ge] {
if v1 != v { continue; }
for (w1, x) in structure.rels_new[Le] {
if w1 != w { continue; }
matches.push([u, v, w, x]);
}
}
}
Observe that not only the conclusion application phase but also the normalization phase can lead to new data:
If an element in the tuple of some relation changes as a result of normalization, then the tuple must be considered as new tuple.
The optimized congruence closure algorithm described by [
3] also implements semi-naive evaluation.
Their
pending
list in algorithm 2.4 corresponds to our set of new tuples in the relation representing the graph of a function.
Semi-naive evaluation is well-suited for online applications, where one alternates RHL evaluation and ad-hoc manipulation.
If this manipulation consists only of adjoining data, then this data can be adjoined to the same data structures that hold new data during RHL evaluation.
The first iteration of subsequent RHL evaluation need then only consider matches for this new data instead of matches in the full relational structure.
3.2 Symmetries
Semi-naive matching of the premise of the functionality axiom for a (binary) function results in two loops:
for (x0, x1, x2) in structure.rels_new[f] {
for (y0, y1, y2) in structure.rels_all[f] {
...
}
}
for (x0, x1, x2) in structure.rels_all[f] {
for (y0, y1, y2) in structure.rels_new[f] {
...
}
}
On the other hand, the congruence closure algorithm described by [
3] requires a single loop only.
Indeed, the second loop is unnecessary due to a
symmetry in the functionality axiom
.
The symmetry is given by swapping
and
.
This results in a semantically equivalent premise, and swapping the variable has the same effect as swapping the two premise atoms.
In such cases, it suffices to consider matches where the first of the two atoms is interpreted with new data.
Another example where symmetries can be exploited is the anti-symmetry axiom
.
3.3 Indices and occurrence lists
Indices are meant to speed up the nested loops that enumerate matches of premises.
The idea is to replace each inner loop by an efficient sublinear lookup with fixed projections.
For example, matching the premise
of a transitivity axiom can be sped up with an index on the first projection.
One thus maintains a map that allows fast lookup of all tuples
for fixed
.
The premise can then be enumerated as follows:
for (u, v) in structure.rels[Le] {
for (_, w) in structure.rels[Le].index[v] {
matches.push((u, v, w));
}
}
Indices are typically realized using variants of ordered search trees or hash maps.
They can be maintained over all iterations, or recreated before and disposed immediately after the premise matching phase in each iteration.
Recreating indices requires less memory compared to maintaining indices, since only indices needed to match a single sequent need to be stored in memory at any time.
Fast Datalog engines often maintain indices over all iterations, which results in faster execution at the expense of increased memory usage.
If indices are maintained over all iterations, new tuples must also be inserted into indices during the conclusion application phase.
For RHL evaluation, however, index maintenance is problematic due to the normalization phase, in which elements in relations are replaced by their canonical representatives if needed.
When using indices, they require normalization, too.
We turn again to the fast congruence closure algorithm described by [
3, Section 2.4] to implement index normalization efficiently.
Their
signature table is a hash map index that speeds up matching premises of functionality axioms.
It is maintained throughout the iterations of the congruence closure algorithm.
Efficient normalization is implemented using further data structures which we shall refer to as
occurrence lists.
An occurrence list contains for each equivalence class the expression nodes which have at least one child in the equivalence class.
In the normalization step, it suffices to normalize those tuples that occur in occurrence lists of elements that ceased to be canonical representatives.
Occurrence lists can be adapted to RHL evaluation as follows.
We associate to each canonical representative a list of tuples in any relation in which the canonical representative occurs.
Tuples in occurrence lists need not be normalized.
In the conclusion application phase, we insert tuples also in the occurrence lists of each element of the inserted tuple.
When merging two equivalence classes, we save the element that ceases to be a canonical representative along with its occurrence list for use during the following normalization phase.
The occurrence lists of the element that remains a canonical representative is set to the concatenation of the two original occurrence lists.
Concatenation can be implemented asymptotically efficiently if occurrence lists are realized as linked lists or rope data structures.
In the normalization phase, we remove each tuple in one of the occurrence lists we saved earlier, normalize the tuple and reinsert it into each index.
When enforcing an equality during the conclusion phase, the algorithm described in [
3, Section 2.4] chooses the element that remains a canonical representative in a way that minimizes the amount of normalization necessary:
Thus, the element with longer occurrence lists should remain canonical.
This applies directly also to occurrence lists in RHL evaluation.
To avoid normalizing tuples that were inserted in the current iteration, the conclusion application phase can be split into an equality application phase, where we only consider equalities in conclusions, and a relation application phase, where we only consider relation atoms.
We then normalize between the equality application phase and the relation application phase.
This has the benefit that new tuples need not be normalized directly after insertion.
3.4 Functional projections
We say that the
th projection of a relation
in an RHL theory is
functional if the
th projection
of each tuple
is uniquely determined by the other components
of the tuple.
More generally, we can consider a set
of projections which are uniquely determined by the complementary projections.
As the name suggests, the functionality axiom for an
-ary function asserts that the
th projection of the graph of the function is functional.
Another example are injective functions, where the first
projections of the graph depend functionally on the
th projection.
When indices are maintained on a relation with a functional projection, equality constraints can be generated already during insertion into the index instead of later during the matching phase.
For example, if
is an
-ary relation representing the graph of a function, then an index on the first
arguments can be maintained to match the premise of the functionality axiom.
Without consideration of functionality of the
th projection, we expect the index to allow lookups of
lists of tuples for fixed value of the first
projections.
Due to the functional dependency, we can instead enforce that lookups result into at most one tuple.
Whenever a second tuple would be inserted into the index which would violate this property, then we generate equality constraints according to functional projections instead.
These equalities are then enforced during the next conclusion application phase.
The
signature table hash map in the efficient congruence closure algorithm described by [
3, Section 2.4] can be understood as an index with functional projection optimization.
4 Applications
In this Section, we discuss two applications of RHL and PHL evaluation to the implementation of programming languages:
Steensgaard's points-to analysis (Section
4.1) and type inference (Section
4.2).
4.1 Steensgaard's points-to analysis
Points-to-analysis aims to bound the set of heap objects a variable in a program can point to.
The two well-known approaches are due to Andersen and Steensgaard.
Both algorithms identify heap objects with their allocation sites, i.e. the program expressions that allocate objects.
The Andersen and Steensgaard analyses thus give an over-approximating answer to the question of whether a given variable
can point to an object that was allocated in expression
.
Both analyses must consider the case where a variable
is a assigned to a variable
.
Andersen-style analysis computes for each variable
a set of objects
that
can point to such that the following properties hold:
-
If there is a statement that assigns an allocating expression
to a variable
, then
can point to
.
-
If a variable
can take the value of a variable
(e.g. as a result of an assignment or a function call), and
can point to
, then
can point to
.
A minimal implementation of Andersen-style analysis using Datalog populates input relations
and
with data derived from the program source code.
The rules above governing the
relation are then encoded in Datalog as follows:
-
-
To summarize, Andersen's algorithm enforces
subset constraints:
If a variable
can take the value of a variable
, either through a function call or a direct assignment, then the points-to set of
is a subset of the points-to set of
.
Steensgaard's algorithm is a less precise but typically faster variation of Andersen's algorithm.
It is based on
equality constraints.
Thus if a variable
can take the value of a variable
, then Andersen's algorithm
equates the points-to sets of
and
.
A direct implementation of Steensgaard's algorithm maintains a union-find data structure on the variables of a program, and a mapping that assigns to each canonical representative a points-to set of heap allocations.
The algorithm scans the program source code and performs the following actions:
-
For each statement that assigns an allocation expression
to a variable
, add
to the points-to set of the canonical representative of
.
-
If a variable
can take the value of a variable
, unify the equivalence classes of
and
.
The points-to set of the unified equivalence class is the union of the points-to sets of the classes of
and
.
Steensgaard's algorithm is strictly less precise than Andersen's, but it typically requires less memory, since only one points-to set needs to be maintained for every equivalence class of variables.
To encode Steensgaard's algorithm in RHL, we can simply replace rule
2 above in Andersen's analysis with the rule
to enforce equality constraints.
4.2 Type inference
Type inference (or type reconstruction) is the task of assigning types to variables and expressions based on their usage in a program fragment.
The constraint-based typing algorithm for the simply typed lambda calculus described in [
10, 22.3, 22.4] assigns to each term a separate, initially unrestricted type variable.
It then collects constraints on type variables according to the usage and definition of the corresponding terms.
This is accomplished by considering typing rules and their inverses.
For example, based on the application typing rule
(4)
(4)
we infer the following constraints from a term
:
-
If
has type
, then
has type
.
-
Conversely, if
has type
, then
has type
for some
.
-
If
has type
, then
has type
.
-
Conversely, if
has type
, then
has type
for some
.
The implicit existentials in constraints
2 and
4 generate new type variables
and
, which must be chosen fresh for each instance of the constraint.
In the following unification step, the generated constraints are checked for compatibility, and if so the most general substitution of type variables is created that satisfies all the constraints.
For example, for the fragment
for variables
and
, the algorithm outputs the substitution
, where
and
are the type variables initially assigned to
and
.
This inference procedure can be implemented in PHL as follows.
We introduce sorts
of terms and
of types.
Each
-ary type or term constructor corresponds to an
-ary PHL function.
For example, function types are encoded as binary function symbol
and application as a function
.
To enforce injectivity of type constructors, we introduce inverse functions for each parameter of a type constructor.
For function types, we add functions
and
and enforce axioms that
and
are indeed inverses to
:
Thus
and
are defined on the same set of types: Those of the form
for types
and
.
To detect violations of joined injectivity of type constructors, we introduce a nullary predicate
and rules such as
for every pair of distinct type constructors, for example function types and list types.
We always arrange
to be empty before PHL evaluation, so that
is inhabited after evaluation if and only if a violation of joined injectivity was inferred.
We encode the typing relation
as a function
instead of a relation because each term has a unique type.
The axiom
enforces that each term has a type.
During evaluation, this non-surjective rule introduces a fresh identifier as type of each term if necessary.
Finally, we encode term constructors as PHL functions and add axioms according to inference rules and their inverses.
For example, the typing rule
(4) of function application results in a PHL function
and the following PHL axioms governing it, corresponding to the constraints
1 --
4 above:
-
-
-
-
Observe that the conclusions of axioms
2 and
4 assert that domains and codomains of certain types exist, which together with our axioms for
and
implies that these types are function types.
If necessary, fresh type identifiers are adjoined during PHL evaluation by the non-surjective axioms asserting that the
and
functions are defined on the same set of types.
With the PHL theory modeling the type system at hand, the full type inference algorithm can now be implemented in three steps:
-
Populate a relational structure based on the program fragment:
We adjoin a unique term identifier for each term in the program fragment and add entries in the relations representing graphs of term constructors.
-
Close the relational structure under the axioms described above.
-
If the
relation contains an element, output an error.
Otherwise, there exists for each type identifier
at most one type constructor
and an entry of the form
with
as last component in the relation corresponding to
.
Expand each type identifier recursively into a maximal syntax tree according to such entries.
Output the maximal syntax tree representing
for each term
in the input program fragment.
5 Conclusion
Using Datalog to implement type checking and inference, as we sketched in Section
4.2, is not a new idea.
For example, the blog post [
9] discusses an implementation of Rust's trait system using Datalog.
One of the issues raised there is that Rust's associated types require reasoning about equalities of types.
A similar issue would arise for Haskell's type families.
The solution proposed in the blog post is to combine Datalog with a normalization algorithm.
RHL's built-in equality might offer a declarative alternative to normalization that applies also in situations where no strongly normalizing rewrite system is available.
Typing rules are typically specified using the notation of natural deduction (e.g. the application typing rule
(4)).
Apart from syntactic differences, the structure of such rules and their semantics correspond almost precisely to PHL.
Indeed, it is generally understood that many type systems can be encoded as
essentially algebraic theories [
6, Chapter 3.D], which have the same descriptive strength as epic PHL.
From this perspective, program fragments can be identified with elements of the
initial model of the PHL theory axiomatizing the type system, i.e. the free model over the empty relational structure.
These conceptual connections and the example in Section
4.2 suggest that PHL and RHL evaluation have the potential to assume a role in the implementation of programming languages paralleling that of parser generators:
Where parser generators produce parsers from grammars, RHL evaluation engines produce type checkers from type systems.
Bibliography
-
Eqlog.
-
The Choice Construct in the {Souffl{\'{e}}} Language. Programming Languages and Systems, pages 163–181, 2021.
-
Variations on the Common Subexpression Problem. Journal of the {ACM}, 27(4):758–771, 1980.
-
Fast Parallel Equivalence Relations in a {Datalog} Compiler. 2019 28th International Conference on Parallel Architectures and Compilation Techniques ({PACT}), 2019.
-
Partial {Horn} logic and cartesian categories. Annals of Pure and Applied Logic, 145(3):314–353, 2007.
-
Locally Presentable and Accessible Categories. 1994.
-
Martin E. Bidlingmaier. Algebraic Semantics of {Datalog} with Equality. 2023.
-
John Cocke and Jacob Theodore Schwartz. Programming Languages and Their Compilers. 1970.
-
Niko Matsakis. Lowering {Rust} traits to logic.
-
Benjamin C. Pierce. Types and Programming Languages. The MIT Press, 2002.
-
Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock and Pavel Panchekha. {Egg}: Fast and Extensible Equality Saturation. Proc. ACM Program. Lang., 5(POPL), 2021.