On that which counting compels

Enuntiatio

Propositio VIII (Theorema). Principium Caveae Columbarum. Let there be given n classes, n a positive integer, and let n+1 objects each be assigned to exactly one class by a fixed rule. Then some one class receives no fewer than two objects.

More generally, if m objects are distributed among n classes, some class receives at least ⌈m/n⌉ objects; the case m = n+1 yields ⌈(n+1)/n⌉ = 2. The principle is not an observation about particular arrangements but a necessity binding every arrangement whatsoever: where the objects outnumber the classes, coincidence is not permitted but compelled. Herein lies its force as a lower-bound instrument — it certifies that a collision must occur without exhibiting which arrangement produces it. From it follows immediately the Corollary that no function from a set of n+1 keys into n buckets can be injective; every such hash collides.

Expressio

from typing import Callable, Hashable, Optional, Sequence, Tuple, TypeVar

T = TypeVar("T")


def find_collision(
    items: Sequence[T],
    key: Callable[[T], Hashable],
    n_classes: int,
) -> Optional[Tuple[int, int]]:
    """Return a forced colliding pair of indices, or None when none is forced.

    By the Pigeonhole Principle, if len(items) > n_classes then two items must
    share a class; this routine exhibits the first such witnessing pair (j, i)
    with j < i and key(items[j]) == key(items[i]). When len(items) <= n_classes
    no collision is compelled, so None is returned.

    >>> find_collision([12, 5, 23, 17, 8, 30, 11], lambda x: x % 6, 6)
    (1, 2)
    >>> find_collision([10, 20, 30], lambda x: x % 7, 7) is None
    True
    >>> find_collision(["a", "b", "a", "c"], lambda s: s, 3)
    (0, 2)
    """
    if n_classes < 1:
        raise ValueError("n_classes must be a positive integer")

    # No collision is COMPELLED when the count does not exceed the classes.
    if len(items) <= n_classes:
        return None

    seen: dict = {}
    for i, item in enumerate(items):
        k = key(item)
        if k in seen:
            j = seen[k]
            assert key(items[j]) == key(items[i]), "pair must share a class"
            assert j < i, "first index must precede the second"
            return (j, i)
        seen[k] = i

    # Pigeonhole guarantees this branch is unreachable when keys land in
    # n_classes buckets and len(items) > n_classes. Its only escape is a
    # key that exceeds the declared range — which violates the hypothesis.
    raise AssertionError(
        f"Pigeonhole violated: {len(items)} items into {n_classes} classes "
        "produced no collision; the key must have escaped range(n_classes)."
    )


if __name__ == "__main__":
    # n+1 keys into n classes: a collision is forced (here, 5 and 23 mod 6 == 5).
    nums = [12, 5, 23, 17, 8, 30, 11]
    pair = find_collision(nums, key=lambda x: x % 6, n_classes=6)
    assert pair is not None and nums[pair[0]] % 6 == nums[pair[1]] % 6
    assert pair == (1, 2)

    # count <= n_classes: nothing compelled, hence None.
    assert find_collision([10, 20, 30], key=lambda x: x % 7, n_classes=7) is None
    assert find_collision([], key=lambda x: x, n_classes=4) is None

    # Any 7 keys into 6 buckets must collide — the lower-bound use.
    assert find_collision(list(range(7)), key=lambda x: x % 6, n_classes=6) is not None

    import doctest

    doctest.testmod(verbose=False)
    print("ALL ASSERTIONS PASSED")

Demonstratio

We prove the canonical case m = n+1 by reductio ad absurdum, then extend to general m, then discharge the algorithm’s correctness against the claim.

I. The counting contradiction. Let the objects be o₁, …, o_{n+1} and the classes C₁, …, C_n, each object assigned to exactly one class. For each class C_t let c_t denote the number of objects it receives. Because every object lands in exactly one class, the counts partition the objects, and so

c₁ + c₂ + ⋯ + c_n = n + 1. (∗)

Suppose, for contradiction, that no class holds two or more objects; that is, c_t ≤ 1 for every t. Summing these n inequalities,

c₁ + c₂ + ⋯ + c_n ≤ 1 + 1 + ⋯ + 1 = n.

Combined with (∗) this yields n + 1 ≤ n, hence 1 ≤ 0 — a falsehood. The supposition is therefore untenable: some c_t ≥ 2. This is the whole mechanism; the principle reduces to the impossibility of n unit summands totalling n+1.

II. The general bound ⌈m/n⌉. Distribute m objects among n classes with counts c₁, …, c_n summing to m. Suppose every c_t ≤ ⌈m/n⌉ − 1. Since each c_t is an integer and ⌈m/n⌉ − 1 < m/n, summing gives m = Σ c_t ≤ n(⌈m/n⌉ − 1) < n·(m/n) = m, i.e. m < m, absurd. Hence some c_t ≥ ⌈m/n⌉. Setting m = n+1 recovers Part I, since ⌈(n+1)/n⌉ = 2.

III. The lower-bound corollary (no injective hash). Let h : K → B be any function from a key set K with |K| = n+1 into a bucket set B with |B| = n. Take the keys as objects and the buckets as classes, with each key k assigned to the class h(k). By Part I some bucket receives two distinct keys k ≠ k′ with h(k) = h(k′). Thus h is not injective. The conclusion holds for every such h with no appeal to its internal structure: it is a lower bound on the unavoidable behaviour of all hashes from n+1 keys into n buckets. Q.E.D.

IV. Correctness of find_collision. We verify the routine realises the claim exactly.

Case len(items) ≤ n_classes. The guard returns None. By the claim this is the regime in which no collision is compelled, and the routine asserts nothing stronger — it does not deny that a collision may happen, only that none is forced. Correct.

Case len(items) > n_classes. The loop maintains the invariant: after processing index i, the map seen contains, for each distinct key encountered among items[0..i], the least index at which that key first appeared. Establishment: before the loop seen is empty, vacuously satisfying the invariant. Maintenance: at step i with key k, if k ∉ seen the item’s key is new and we record (k ↦ i), preserving “least index”; if k ∈ seen we have found j = seen[k] < i with key(items[j]) = k = key(items[i]) — a genuine colliding pair, the first to be discovered in scan order — and we return it. The two inline assertions certify both the shared-class property and j < i at the moment of return.

Termination and totality. The loop runs at most len(items) iterations. Suppose it completed without returning. Then all keys key(items[0]), …, key(items[len−1]) are pairwise distinct, giving len(items) distinct keys. The hypothesis states these keys lie in n_classes classes (the range of the assignment), so by Part I, len(items) > n_classes forces two of them to coincide — contradicting their distinctness. Hence the loop must return before exhausting the items; the trailing AssertionError is unreachable under the stated hypothesis and fires only if a key escapes range(n_classes), i.e. if the precondition is violated. The routine therefore returns a valid forced pair in this case and runs in O(n) time and O(n) auxiliary space, n = len(items), each key inserted or looked up once. The implementation meets the falsifiable claim in both regimes. Q.E.D.

Scholium. The principle’s power is precisely its silence on construction: it asserts existence by counting alone, refusing to name the witness. The algorithm above breaks that silence operationally — it produces the witness — yet the guarantee that a witness exists rests on Part I, not on the search. One may exhaust every arrangement of n+1 letters into n boxes and never escape a repeat; the arithmetic c₁ + ⋯ + c_n = n+1 with each c_t ≤ 1 is simply forbidden. Hypotheses non fingo: I assert only what the sum compels.