tva
← Insights

Tablas de Cierre para el Cálculo de Relaciones

Family relationship queries are deceptively hard. “Are these two people related?” “What is our degree of separation?” “Who is the closest common ancestor?” These questions feel natural to a person but are expensive to answer with standard adjacency list storage, where each row records only a direct parent-child pair. Answering them requires traversing the tree recursively, which in SQL means recursive CTEs — correct, but slow on deep trees and difficult to compose.

Closure tables solve this by pre-computing and storing the full reachability of the tree. Every ancestor-descendant pair is recorded explicitly, with depth, at write time. The result is dramatically faster reads at the cost of more complex writes. For a family tree application where reads are far more frequent than writes, this trade-off is almost always correct.


The Adjacency List Problem

The standard way to store a tree in a relational database is an adjacency list: a table where each row has an id and a parent_id. It is simple to understand and simple to write. Inserting a new person requires one row. Updating a parent relationship requires one update.

But answering “who are all the ancestors of person X?” requires walking up the tree recursively. In PostgreSQL and most modern SQL databases this is expressed with a recursive CTE:

WITH RECURSIVE ancestors AS (
  SELECT id, parent_id, 0 AS depth
  FROM people
  WHERE id = :person_id

  UNION ALL

  SELECT p.id, p.parent_id, a.depth + 1
  FROM people p
  JOIN ancestors a ON p.id = a.parent_id
)
SELECT * FROM ancestors;

This works. But on a deep tree, the recursive CTE scans rows at each level and the cost grows with tree depth. For a shallow tree — three or four generations — it is fast enough. For a tree spanning ten or fifteen generations with hundreds of nodes, the performance degrades noticeably. Worse, composing two recursive CTEs for a query like “find the common ancestor of X and Y” requires running both recursions and then joining on the results, which compounds the cost.

The Closure Table Structure

A closure table adds a second table alongside the primary people table. Each row in the closure table represents a reachability relationship: a pair of (ancestor_id, descendant_id), plus the depth (number of edges between them). Every person is also their own ancestor at depth zero.

CREATE TABLE family_closure (
  ancestor_id   INTEGER NOT NULL REFERENCES people(id),
  descendant_id INTEGER NOT NULL REFERENCES people(id),
  depth         INTEGER NOT NULL,
  PRIMARY KEY (ancestor_id, descendant_id)
);

CREATE INDEX idx_closure_descendant ON family_closure (descendant_id);
CREATE INDEX idx_closure_ancestor   ON family_closure (ancestor_id);

When person X exists without a parent, one row is inserted: (X, X, 0). When person X is added as a child of person P, the insertion expands to include all of P’s ancestors:

-- Self-reference
INSERT INTO family_closure (ancestor_id, descendant_id, depth)
VALUES (:x_id, :x_id, 0);

-- All ancestors of parent P become ancestors of X, at depth + 1
INSERT INTO family_closure (ancestor_id, descendant_id, depth)
SELECT ancestor_id, :x_id, depth + 1
FROM family_closure
WHERE descendant_id = :parent_id;

This means inserting a person at generation depth N requires inserting N+1 rows into the closure table. For a typical family tree, where depth rarely exceeds 15 or 20 generations, this is entirely manageable. The write overhead is proportional to depth, not to the total size of the tree.

Querying All Ancestors

With the closure table in place, querying all ancestors of a person is a single join with no recursion:

SELECT p.*, fc.depth
FROM family_closure fc
JOIN people p ON p.id = fc.ancestor_id
WHERE fc.descendant_id = :person_id
  AND fc.depth > 0
ORDER BY fc.depth;

This query executes with a single index scan on descendant_id. The cost is proportional to the number of ancestors, not to the depth of the tree. For a tree of typical genealogical depth, this is an order of magnitude faster than the equivalent recursive CTE, and the difference grows as the tree deepens.

The symmetric query — all descendants of a person — is the same pattern using the ancestor_id index:

SELECT p.*, fc.depth
FROM family_closure fc
JOIN people p ON p.id = fc.descendant_id
WHERE fc.ancestor_id = :person_id
  AND fc.depth > 0
ORDER BY fc.depth;

Lowest Common Ancestor

The Lowest Common Ancestor (LCA) of two people is the most recent ancestor they share. In family tree terms, it is the answer to “who is the most recent common ancestor of these two individuals?” — the grandparent, or great-grandparent, or further back, depending on the relationship.

With a closure table, the LCA query is a join on the two ancestor sets, filtered to shared ancestors, ordered by depth to find the closest one:

SELECT a.ancestor_id, a.depth + b.depth AS total_distance
FROM family_closure a
JOIN family_closure b ON a.ancestor_id = b.ancestor_id
WHERE a.descendant_id = :person_a
  AND b.descendant_id = :person_b
ORDER BY total_distance ASC
LIMIT 1;

This returns the shared ancestor with the smallest combined distance from both people. On the two index scans on descendant_id, this runs in milliseconds even on moderately large trees. There is no recursion, no multiple passes through the tree, and no intermediate materialisation of ancestor sets beyond what the index provides.

The query also naturally handles cases where one person is an ancestor of the other. If person A is a direct ancestor of person B, the LCA is person A itself, and the query returns it correctly because A is in its own ancestor set at depth 0.

Degree of Separation

The degree of separation between two people is the number of family edges you need to traverse to get from one to the other via their shared ancestor. It is not the same as the depth difference between them — it is the sum of each person’s distance from their LCA.

SELECT MIN(a.depth + b.depth) AS degrees_of_separation
FROM family_closure a
JOIN family_closure b ON a.ancestor_id = b.ancestor_id
WHERE a.descendant_id = :person_a
  AND b.descendant_id = :person_b;

This is the same join as the LCA query with MIN() aggregate instead of ORDER BY ... LIMIT 1. First cousins have a degree of separation of 4 (both are two steps from a shared grandparent). First cousins once removed have a degree of 5. This query returns those values correctly without any application-layer traversal logic.

The relationship description — “first cousin twice removed”, “great-uncle” — requires additional logic that maps the depth values to relationship labels. A function that takes the depth from person A to LCA and the depth from person B to LCA and returns a relationship label covers most common cases. This is application-layer logic, not SQL, and it is considerably simpler when the depth values are available as clean integers from the query.

Maintaining the Closure Table on Updates

Insertions are straightforward, as shown above. Deletions and parent-relationship changes require more care.

Deleting a person from the tree means removing all rows in the closure table where they appear as either ancestor or descendant. If the person has descendants, those descendants are now disconnected from the person’s ancestor set. Whether this is the correct behaviour depends on the application — in a genealogy context, deleting a person who has recorded descendants is typically prevented at the application layer.

Changing a parent relationship — correcting a recorded parent to a different person — requires removing the old closure rows for the subtree rooted at the child and reinserting them relative to the new parent. The simplest implementation is a transaction that deletes all closure rows for the entire affected subtree and reinserts them by traversing from the new parent. This is correct but locks the relevant rows during the transaction. For most genealogy applications, which are write-light, this is acceptable.

Performance Comparison

The performance difference between closure tables and recursive CTEs is significant and grows with tree size. On a synthetic tree with 500 nodes and 12 average generations of depth, an LCA query using recursive CTEs runs in a range that makes it feel slightly sluggish in a UI context. The same query using the closure table join is fast enough that the latency is dominated by network round-trip time rather than query execution.

The write overhead of maintaining the closure table adds a measurable cost per insertion, proportional to tree depth. But in a family tree context, insertions are rare relative to reads — a family tree might be queried thousands of times to display relationships and visualisations for every single person added. The trade-off strongly favours the closure table approach.

The storage overhead is also worth noting. A closure table for a 500-person tree with average depth of 12 stores roughly 3,000 to 5,000 rows. At a few bytes per row, this is negligible. Even a very large genealogy database with tens of thousands of individuals produces a closure table that fits comfortably in memory on a modern database server, where it will be cached and available for fast index lookups.

Related Insights

Artículos relacionados