## Home

November 23, 2019

# Emulatibility Proof for Teradata's OVERLAPS on Google BigQuery

In a cinematic setting sprinkled with clever parallels to life’s overarching motifs, the great Jennifer Aniston once declared that the ability to express onself is not dependent upon the possession of 37 pieces of flair.

Alas, were the profundity of such wisdom to extend to the realm of databases, where the very currency of expression on any given platform is comprised of both the quality and quantity of available functional flair.

Let us focus on the presence of the `OVERLAPS` function in Teradata, which accepts two period values and determines whether or not the periods overlap. Google’s BigQuery defines no such function, and so we must seek an alternate means of expressing ourselves. Fortunately, BigQuery defines the functions `GREATEST` and `LEAST`, which we can use to emulate `OVERLAPS` despite its absence.

In this post, I walk through a formal proof of the correctness of such an emulation using Coq. You can find the proof in its entirety in the `teradata-overlaps-proof` repository on GitHub.

# Teradata’s `OVERLAPS`

Multiple overloads of Teradata’s `OVERLAPS` function exist; here we will limit ourselves to just those variants that accept a period expression. According to Teradata’s documentation, if `p1` and `p2` are valid periods they can destructured into their constituent parts:

``````S1 = BEGIN(p1)
E1 = END(p1)
S2 = BEGIN(p2)
E2 = END(p2)
``````

such that the expression `(S1, E1) OVERLAPS (S2, E2)` expands to the expression:

``````(S1 > S2 AND NOT (S1 >= E2 AND E1 >= E2))
OR
(S2 > S1 AND NOT (S2 >= E1 AND E2 >= E1))
OR
(S1 = S2 AND (E1 = E2 OR E1 <> E2))
``````

# BigQuery’s `GREATEST` and `LEAST`

In BigQuery you will find no data type equivalent to Teradata’s period type, nor will you find a function equivalent to `OVERLAPS`. However, armed with the function’s Boolean definition above, and inspired by an informal proof written by Bart Wiegmans, we can emulate `OVERLAPS` on BigQuery as:

``````GREATEST(S1, S2) < LEAST(E1, E2)
``````

We’ll aim to prove that, for any two period expressions, this translated definition evaluates to the same result as that yielded by the evaluation of the original Teradata definition. By doing so, we will thereby show that the two definitions are equivalent.

# Initial Proof Structures and Definitions

Each period value consists of two datetime values, namely the starting and ending datetimes defining the period’s boundaries, with the end being strictly greater than the start. Every unique datetime can be mapped to a unique natural number; therefore, our proof will operate over the naturals. We’ll start with the following preamble:

``````Require Import Omega.

Local Open Scope nat_scope.
``````

The period type is defined using a self-explanatory `Inductive` definition, with supporting functions `pstart` and `pend` defined to extract the start and end boundaries, respectively, from any given period value:

``````Inductive period : Type :=
| Period (a b : nat).

=>finition pstart (p : period) : nat :=
match p with | Period a b => a end.

Definition pend (p : period) : nat :=
match p with | Period a b => b end.
``````

Next, we need a relation that expresses what it means to be a valid period. A few sentences ago, I stated that a period’s end boundary must be strictly greater than its start, and so our relation looks like:

``````Inductive periodR : period -> Prop :=
| PeriodR : forall p,
pstart p < pend p ->
periodR p.
``````

If we had any other constraints on what it means to be a period, they would go in this `periodR` relation.

We need two more definitions before we can commence with the proving: the definition of `OVERLAPS` as Teradata defines it, along with our emulated definition expressed in terms of BigQuery functions. Both are direct extensions of the informal definitions stated above. For Teradata, we have:

``````Inductive overlapsTeradataR : period -> period -> Prop :=
| TeradataSGt : forall p1 p2,
periodR p1 ->
periodR p2 ->
(pstart p1) > (pstart p2) ->
~ (pstart p1 >= pend p2 /\ pend p1 >= pend p2) ->
| TeradataSLt : forall p1 p2,
periodR p1 ->
periodR p2 ->
(pstart p2) > (pstart p1) ->
~ (pstart p2 >= pend p1 /\ pend p2 >= pend p1) ->
| TeradataSEq : forall p1 p2,
periodR p1 ->
periodR p2 ->
pstart p1 = pstart p2 ->
(pend p1 = pend p2 \/ pend p1 <> pend p2) ->
``````

There are three inductive cases, one for each disjunctive in the definition from Teradata’s documentation. The only difference here is that we explicitly require `p1` and `p2` to be valid periods for the relation to hold in any of the three cases; this is an example of an obvious assumption that’s safe to elide informally, but (typically) must be present in a formal proof setting. (Exercise for the reader: try removing just these `periodR` assumptions in this definition and see if the proof below still holds.)

Now that we have a formal definition for Teradata’s `OVERLAPS`, let’s define our emulated definition for BigQuery:

``````Inductive overlapsBigQueryR : period -> period -> Prop :=
| BigQuery : forall p1 p2,
periodR p1 ->
periodR p2 ->
max (pstart p1) (pstart p2) < min (pend p1) (pend p2) ->
(overlapsBigQueryR p1 p2).
``````

Consistent with our informal definition above, we have just one inductive case, an expression in terms that BigQuery will understand, with Coq’s `max` standing in for BigQuery’s `GREATEST` and `min` for `LEAST`. A look at the definitions for each pair of those functions should convince you that they are equivalent – but no need to take my word for it! Aside: Coq’s `min` and `max` are binary, while BigQuery’s `GREATEST` and `LEAST` are n-ary, a distinction which doesn’t impact this particular proof but is worth pointing out.

# The Proof

Now the fun begins. In this section, I’ll walk through the proof, adding what I hope are helpful explanations for others who are learning or embarking on proofs of a similar nature. One of the tricky things about Coq proofs is that you can’t really read them for understanding, but must fire up `coqide` or Proof General and step through them yourself. In the excerpts below, I include the state of the goal stack corresponding to various states of the proof, but there really is no substitution for walking through it oneself.

We open the proof by stating that we aim to prove that our definitions of `OVERLAPS` for both Teradata and BigQuery are equivalent (`<->`) for all period values; more precisely, for any two periods `p1` and `p2`, if our Teradata relation holds, so should our BigQuery relation, and vice versa:

``````Theorem overlaps_Teradata_BigQuery_equiv : forall p1 p2 : period,
overlapsTeradataR p1 p2 <-> overlapsBigQueryR p1 p2.
Proof.

(* === GOAL STACK ============================================= *)

1 subgoal

forall p1 p2 : period, overlapsTeradataR p1 p2 <-> overlapsBigQueryR p1 p2
``````

Equivalence is just bidirectional implication; therefore we need two goals, one for each direction of implication, and we obtain these via `split`:

``````Proof.
split.

(* === GOAL STACK ============================================= *)

2 subgoals

- p1, p2 : period

overlapsTeradataR p1 p2 -> overlapsBigQueryR p1 p2

subgoal 2 is:
overlapsBigQueryR p1 p2 -> overlapsTeradataR p1 p2
``````

As we can see from the goal stack, we are first asked to show that if two periods overlap according to our Teradata definition, they also overlap according to our BigQuery definition. We’ll use a bullet point `*` to focus on this first subgoal. Then we want to introduce the former as a hypothesis (using `intros H`), and next “open it up” so as to make use of the assumptions inherent to that hypothesis. This latter step is accomplished by inverting the introduced hypothesis using `inversion H`.

``````Proof.
split.
* intros H. inversion H; subst.

(* === GOAL STACK ============================================= *)

3 subgoals

- p1, p2 : period
- H : overlapsTeradataR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 > pstart p2
- H3 : ~ (pstart p1 >= pend p2 /\ pend p1 >= pend p2)

overlapsBigQueryR p1 p2

subgoal 2 is:
overlapsBigQueryR p1 p2
subgoal 3 is:
overlapsBigQueryR p1 p2
``````

(Aside: why `inversion H` instead of `induction H`? Simply because there’s no need for induction here: none of the Teradata relation constructors have an assumption that would generate an induction hypothesis. Try switching out `inversion H` with `induction H` here, and you’ll find that both tactics yield the exact same subgoals and contexts for this proof.)

We’re focused on the first direction of implication (Teradata to BigQuery), so Coq shows us three subgoals generated by our inversion of `H`, one for each inductive case of the Teradata relation. In other words, Coq is saying that for any possible way in which the Teradata relation could hold for two period values, the BigQuery relation must be shown to hold as well.

The first subgoal assumes that the Teradata relation holds by way of the `TeradataSGt` constructor. Before we go on to prove this particular subgoal, let’s dispense with some mechanics that we know will be useful in each of the three cases; by doing so, we won’t have to repeat ourselves for each subgoal. Two salient observations come to mind:

• Look at `H0` and `H1`. Both are useful assumptions to have, in that they allow us to assume that the periods `p1` and `p2` are both valid periods. But written in this way, they are rather opaque; a valid period’s start is strictly less than its end, and it’s that comparative relationship we’ll want to make use of when proving each subgoal, not this opaque statement. We’ll join `inversion H0; inversion H1; subst` to our proof so far using the semicolon operator to do this for all subgoals.
• Each of the three subgoals is simply `overlapsBigQueryR p1 p2`. Given that there’s only one constructor for the BigQuery relation, let’s apply that constructor to all three of our subgoals, also by using the semicolon operator.
``````Proof.
split.
* intros H. inversion H; subst;
inversion H0; inversion H1; subst; apply BigQuery.

(* === GOAL STACK ============================================= *)

9 subgoals

- p1, p2 : period
- H : overlapsTeradataR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 > pstart p2
- H3 : ~ (pstart p1 >= pend p2 /\ pend p1 >= pend p2)
- H4 : pstart p1 < pend p1
- H6 : pstart p2 < pend p2

periodR p1

subgoal 2 is:
periodR p2
subgoal 3 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 4 is:
periodR p1
subgoal 5 is:
periodR p2
subgoal 6 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 7 is:
periodR p1
subgoal 8 is:
periodR p2
subgoal 9 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
``````

One downside to using the semicolon operator is that “natural” subgoal grouping and focusing may no longer be possible. Above, we had three subgoals corresponding to each of the three inductive variants in our Teradata definition; now we have nine subgoals, one for each of those variants’ assumptions. But I think the benefits outweight the loss of legibility here, especially in light of the next observation: six of those subgoals are immediately evident from our assumptions that `p1` and `p2` are in fact valid periods, and thus we can dispatch them using `assumption`:

``````Proof.
split.
* intros H. inversion H; subst;
inversion H0; inversion H1; subst; apply BigQuery; try assumption.

(* === GOAL STACK ============================================= *)
3 subgoals

- p1, p2 : period
- H : overlapsTeradataR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 > pstart p2
- H3 : ~ (pstart p1 >= pend p2 /\ pend p1 >= pend p2)
- H4 : pstart p1 < pend p1
- H6 : pstart p2 < pend p2

Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)

subgoal 2 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 3 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
``````

Now we’re getting to the core of the proof for the Teradata to BigQuery direction of implication. Yet we can clean up a bit more before proceeding with the first subgoal. Notice that `H3` is always a conjunctive, and sometimes the negation of a conjunctive. If we don’t distribute the negation now, we’ll need to do it for each case individually. Furthermore, given that `H3` is always a conjunctive, let’s break it up now; doing so will duplicate each of our three current subgoals, one for each of the two conjunctive branches:

``````Proof.
split.
* intros H. inversion H; subst;
inversion H0; inversion H1; subst; apply BigQuery; try assumption;
try (apply Decidable.not_and in H3; try apply dec_ge).

(* === GOAL STACK ============================================= *)
6 subgoals

- p1, p2 : period
- H : overlapsTeradataR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 > pstart p2
- H3 : ~ pstart p1 >= pend p2
- H4 : pstart p1 < pend p1
- H6 : pstart p2 < pend p2

Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)

subgoal 2 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 3 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 4 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 5 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
subgoal 6 is:
Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)
``````

As expected, we have our six subgoals (we would have had more of the form `Decidable.decidable (pstart ? >= pend ?)` had we not dispatched these by simply applying `dec_ge` where applicable, as such comparative `>=` expressions are obviously decidable). Proving each of them is a more-or-less routine matter of rewriting the goal to match our assumptions in each case. Stepping through each would be rather tedious of me, so I’ll simply present all six at once and highlight the finer points:

``````Proof.
split.
- intros H. inversion H; subst;
inversion H0; inversion H1; subst; apply BigQuery; try assumption;
try (apply Decidable.not_and in H3; try apply dec_ge);
destruct H3 as [|].
* rewrite max_l.
+ apply not_le in H3. apply Nat.min_glb_lt; assumption.
+ apply not_ge in H3. rewrite Nat.lt_eq_cases. left. assumption.
* rewrite min_l; apply not_ge in H3.
+ rewrite max_l. assumption.
rewrite Nat.lt_eq_cases. left. assumption.
+ rewrite Nat.lt_eq_cases. left. assumption.
* rewrite max_r.
+ apply not_ge in H3. apply Nat.min_glb_lt; assumption.
+ rewrite Nat.lt_eq_cases. left. assumption.
* rewrite max_r.
+ rewrite min_r. assumption.
rewrite Nat.lt_eq_cases. apply not_ge in H3. left. assumption.
+ rewrite Nat.lt_eq_cases. left. assumption.
* rewrite max_r.
+ rewrite min_r. assumption.
rewrite Nat.lt_eq_cases. right. symmetry. assumption.
+ rewrite Nat.lt_eq_cases. right. assumption.
* rewrite max_r.
+ rewrite H2 in H4. apply Nat.min_glb_lt; assumption.
+ rewrite Nat.lt_eq_cases. right. assumption.

(* === GOAL STACK ============================================= *)
1 subgoal

subgoal 1 is:
overlapsBigQueryR p1 p2 -> overlapsTeradataR p1 p2
``````

There are some patterns here that aren’t quite so easily dispatched with `try` or `repeat`; I’m sure this could be dealt with quite nicely using Ltac, but seeing as this is a single proof rather than a whole library of them, I think Ltac would be overkill here. Most of this is straightforward, but two points do merit further discussion here:

• `Nat.min_glb_lt`: Anytime we have a goal of the form `pstart pX < Init.Nat.min (pend pY) (pend pZ)`, we need to split this up so we can make use of the Boolean expressions comprising the Teradata definition assumptions. Applying `Nat.min_glb_lt` gives us two goals, one for `pstart pX < pend pY` and another for `pstart pX < pend pZ`, which are then trivial to dispatch via assumptions.
• `Nat.lt_eq_cases`: Whenever we have a comparison that isn’t strictly greater or less than, we’d like to split that up so we have a choice about which assumption the comparison actually relies upon. Given a goal of form `pstart pX <= pstart pY`, applying `Nat.lt_eq_cases` gives us `pstart pX < pstart pY \/ pstart pX = pstart pY`; from there we can choose `left` or `right` depending upon what’s available to us in the context.

As we can see from the goal stack, we are finished with the Teradata to BigQuery implication direction. Now we must prove the converse, and it’s this direction that I find to be the more interesting of the two.

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.

(* === GOAL STACK ============================================= *)
1 subgoal

- p1, p2 : period
- H : overlapsBigQueryR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : Init.Nat.max (pstart p1) (pstart p2) < Init.Nat.min (pend p1) (pend p2)

``````

As before, we invert `H` to obtain a more useful goal context. Unlike before, we don’t need to invert `H0` and `H1`. Those assumptions are implied by `H2`, along with two other important assumptions that we need: `(pstart p2) < (pend p1)` and `(pstart p1) < (pend p2)`. They are within reach using `Nat.min_glb_lt_iff`:

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.
apply Nat.max_lub_lt_iff in H2.
destruct H2 as [H3 H4].
apply Nat.min_glb_lt_iff in H3.
apply Nat.min_glb_lt_iff in H4.
destruct H3. destruct H4.

(* === GOAL STACK ============================================= *)
1 subgoal (ID 475)

- p1, p2 : period
- H : overlapsBigQueryR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 < pend p1
- H3 : pstart p1 < pend p2
- H4 : pstart p2 < pend p1
- H5 : pstart p2 < pend p2

``````

We’ve reached the trickiest part in the proof, at least in my opinion. I stared at this goal stack for a long time, not knowing quite how to proceed. None of the assumptions here allow us to directly apply one of the constructors of `overlapsTeradataR` – they just don’t convey enough information for us to do so profitably.

To make this abundantly clear, note how each of the three constructors maps to a different relationship between the start boundaries of `p1` and `p2`:

• greater than: `(pstart p1) > (pstart p2)`
• less than: `(pstart p2) > (pstart p1)`
• equal: `pstart p1 = pstart p2`

But all of the assumptions in the context only relate the start of one period to the end of either itself or the other, never to the start. So how can we make any progress?

I eventually realized I would need a supporting lemma, and I found it in the Coq standard libary by the name of `Nat.max_spec`:

``````Lemma max_spec n m :
(n < m /\ max n m == m) \/ (m <= n /\ max n m == n).
``````

This is perfect for our needs, because it covers all three relationships required by the Teradata relation: on the left we have strictly less than, and on the right we have strictly greater than and equality (recall that we’ll be able to break `<=` apart using `Nat.lt_eq_cases`). By asserting this into our context and subsequently splitting up the disjunctive and conjunctives within, we get two subgoals:

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.
apply Nat.max_lub_lt_iff in H2.
destruct H2 as [H3 H4].
apply Nat.min_glb_lt_iff in H3.
apply Nat.min_glb_lt_iff in H4.
destruct H3. destruct H4.
assert (HMQ:=Nat.max_spec (pstart p1) (pstart p2)).
destruct HMQ as [|]; destruct H6 as [].

(* === GOAL STACK ============================================= *)
2 subgoals

- p1, p2 : period
- H : overlapsBigQueryR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 < pend p1
- H3 : pstart p1 < pend p2
- H4 : pstart p2 < pend p1
- H5 : pstart p2 < pend p2
- H6 : pstart p1 < pstart p2
- H7 : Nat.max (pstart p1) (pstart p2) = pstart p2

subgoal 2 (ID 495) is:
``````

This is exactly what we want to see. The hypothesis `H6` is now a clear signal that we are ready to apply the `TeradataSLt` constructor:

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.
apply Nat.max_lub_lt_iff in H2.
destruct H2 as [H3 H4].
apply Nat.min_glb_lt_iff in H3.
apply Nat.min_glb_lt_iff in H4.
destruct H3. destruct H4.
assert (HMQ:=Nat.max_spec (pstart p1) (pstart p2)).
destruct HMQ as [|]; destruct H6 as [].
apply PeriodR. assumption. assumption. assumption.
unfold not. intros. destruct H8 as [].

(* === GOAL STACK ============================================= *)
1 subgoal

- p1, p2 : period
- H : overlapsBigQueryR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 < pend p1
- H3 : pstart p1 < pend p2
- H4 : pstart p2 < pend p1
- H5 : pstart p2 < pend p2
- H6 : pstart p1 < pstart p2
- H7 : Nat.max (pstart p1) (pstart p2) = pstart p2
- H8 : pstart p2 >= pend p1
- H9 : pend p2 >= pend p1

False
``````

What do we do here? Whenever I get a `False` goal to prove, I’ve learned to remember that one must be able to show that somewhere in the context lies a contradiction. Here the contradiction lies between `H4` and `H8`: the former states that `p1`‘s ending is strictly greater than `p2`’s start, while the latter states that `p1`’s ending is less than or equal to `p2`’s start. This is obviously a contradiction which introduces `False` into our context, and we can use this `False` to prove our goal of `False` by simply invoking `intuition`. This process is always a mind-bender for me, but it makes sense if you carefully step through it with Coq and observe how your negated goal is preserved, it’s just that the negation itself remains while your non-negated goal statement is elevated to the context so that you can prove the contradiction there.

After invoking `intuition`, the rest of the proof proceeds as expected up until just before the very end:

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.
apply Nat.max_lub_lt_iff in H2.
destruct H2 as [H3 H4].
apply Nat.min_glb_lt_iff in H3.
apply Nat.min_glb_lt_iff in H4.
destruct H3. destruct H4.
assert (HMQ:=Nat.max_spec (pstart p1) (pstart p2)).
destruct HMQ as [|]; destruct H6 as [].
apply PeriodR. assumption. assumption. assumption.
unfold not. intros. destruct H8 as []. intuition.
* rewrite Nat.lt_eq_cases in H6. destruct H6 as [|].
apply PeriodR. assumption. assumption. assumption.
unfold not. intros. destruct H8 as []. intuition.
apply PeriodR. assumption. assumption. auto.

(* === GOAL STACK ============================================= *)
1 subgoal (ID 779)

- p1, p2 : period
- H : overlapsBigQueryR p1 p2
- H0 : periodR p1
- H1 : periodR p2
- H2 : pstart p1 < pend p1
- H3 : pstart p1 < pend p2
- H4 : pstart p2 < pend p1
- H5 : pstart p2 < pend p2
- H6 : pstart p2 = pstart p1
- H7 : Nat.max (pstart p1) (pstart p2) = pstart p1

pend p1 = pend p2 \/ pend p1 <> pend p2
``````

On the brink of glory, this final goal stands in our way. I must admit this also stymied me for some time. This is one of those statements I take for granted as being true, but of course Coq requires proof that it is in fact true in this particular situation. Thankfully, I’ve defined periods over the natural numbers, for which equality is decidable, rather than i.e. tri-state logic, which would make for an interesting proof in and of itself. But for now I think I’ll stick with the naturals, so with an incantation of `decide equality`, our proof is complete.

``````Proof.
split.
- (* ... elided ... *)
- intros H. inversion H. subst.
apply Nat.max_lub_lt_iff in H2.
destruct H2 as [H3 H4].
apply Nat.min_glb_lt_iff in H3.
apply Nat.min_glb_lt_iff in H4.
destruct H3. destruct H4.
assert (HMQ:=Nat.max_spec (pstart p1) (pstart p2)).
destruct HMQ as [|]; destruct H6 as [].
apply PeriodR. assumption. assumption. assumption.
unfold not. intros. destruct H8 as []. intuition.
* rewrite Nat.lt_eq_cases in H6. destruct H6 as [|].
apply PeriodR. assumption. assumption. assumption.
unfold not. intros. destruct H8 as []. intuition.
apply PeriodR. assumption. assumption. auto.
decide equality.
Qed.

(* === GOAL STACK ============================================= *)