On the optimal contained within the optimal
De Subsequentia Communi Maxima: Optimum intra Optimum
Enuntiatio
Propositio VII. Theorema. Let two finite sequences be given, and . Denote by the length of the longest common subsequence of the prefixes and . Then obeys the law
The optimum over the whole is assembled, without loss, from the optima over its prefixes — the optimal is contained within the optimal. There are exactly distinct subproblems, each labelled by a pair ; hence the recurrence, whether memoized from the top or filled from the bottom, computes in time and space . This bound is the burden of the demonstration that follows.
Expressio
def lcs(a, b):
"""Length of the Longest Common Subsequence of sequences a and b.
Fills the table L(i, j) = LCS length of a[:i] and b[:j] from the
bottom up, per Propositio VII. Returns L(m, n).
>>> lcs("ABCBDAB", "BDCAB")
4
>>> lcs("AGGTAB", "GXTXAYB")
4
>>> lcs("", "anything")
0
>>> lcs("abc", "abc")
3
>>> lcs("abc", "def")
0
>>> lcs("XMJYAUZ", "MZJAWXU") # classic instance: "MJAU"
4
"""
m, n = len(a), len(b)
# dp[i][j] holds L(i, j); row 0 and column 0 are the empty-prefix base case (all zeros).
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if a[i - 1] == b[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1 # extend the matched prefix
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1]) # drop one trailing symbol
return dp[m][n]
# --- Inline demonstrations of correctness (each row is L(m, n) of the inputs) ---
assert lcs("", "") == 0
assert lcs("a", "a") == 1
assert lcs("a", "b") == 0
assert lcs("ABCBDAB", "BDCAB") == 4 # e.g. "BCAB" or "BDAB"
assert lcs("AGGTAB", "GXTXAYB") == 4 # "GTAB"
assert lcs("abcde", "ace") == 3 # "ace"
assert lcs("aaaa", "aa") == 2 # multiplicity is capped by the shorter run
if __name__ == "__main__":
import doctest
doctest.testmod()
Demonstratio
We prove two things: first that the recurrence is correct — that as computed equals the true LCS length of the prefixes; second that it costs .
Part I — Optimal substructure (the recurrence is exact). Fix and write , . Let be any longest common subsequence of and , so . We argue by cases on the last symbols.
Case 1: . We claim and is an LCS of and .
-
(Append, giving .) Take any common subsequence of and of length . Since , the string is a common subsequence of and (the appended symbol is matched to position in and position in , both strictly beyond every index used by ). Hence .
-
(Truncate, giving .) Consider the chosen optimum . Suppose first . Then no embedding of into uses index , nor into uses index (a symbol equal to could only sit at the tail). So embeds into and , whence , and we are already done. Otherwise . Then embeds into and (strip the matched tail), so , i.e. .
Combining the two inequalities, .
Case 2: . No common subsequence can match to , since they differ. Thus any common subsequence of fails to use index of , or fails to use index of (it cannot pair them, and a subsequence symbol equal to neither tail forces no tail match — but more directly: ‘s final symbol, wherever embedded, cannot be simultaneously and ). Hence every such is a common subsequence either of or of , giving . Conversely both and are sub-instances whose common subsequences remain common to , so . Equality follows.
Base case. If or one prefix is empty, the only common subsequence is the empty one, and .
By strong induction on — every right-hand side , , has strictly smaller index sum and is correct by hypothesis — the table value equals the true for every . In particular is the LCS length of and . The induction is well-founded because the dependency relation strictly decreases and is bounded below by ; there are no cycles, so memoized top-down recursion and bottom-up fill compute identical values.
Part II — Count of subproblems and complexity. Each subproblem is uniquely named by a pair with and . The set of such pairs has cardinality exactly ; these are all the distinct subproblems the recurrence can summon, and they overlap heavily — , say, is demanded by , by and by alike. A naive recursion without a table recomputes such shared values exponentially often; the table collapses this to one evaluation apiece.
Define a potential = number of table cells not yet finalized, initially . The bottom-up loop visits each of the interior cells once (the border cells are set in at allocation). Each interior visit performs work — one comparison and either an increment or a two-way maximum, every operand being an already-finalized neighbour with smaller index sum — and decrements by one. Total work is therefore
with space for the table. (A standard refinement keeps only the previous row, reducing space to while preserving the bound, since each depends only on row and row .) The empirical check — all doctests pass and randomised instances over a -letter alphabet agree exactly with an exhaustive brute-force oracle — corroborates the derivation but does not constitute it; hypotheses non fingo, the bound rests on the counting argument above, not on the trials.
Q.E.D.