Efficiently Finding All Longest Common Subsequences of Two Strings with Dynamic Programming
Efficiently Finding All Longest Common Subsequences of Two Strings with Dynamic Programming
When dealing with text analysis or bioinformatics, finding all the longest common subsequences (LCS) of two strings is essential. This article delves into an efficient multi-step approach using dynamic programming (DP) to solve this problem. We will explain the method, provide a Python implementation, and discuss its time and space complexities.
Step 1: Computing the Length of the LCS Using Dynamic Programming
The first step involves using dynamic programming to calculate the lengths of all LCS for all pairs of prefixes of the two strings. This can be achieved using a 2D array, often referred to as a DP table.
Initialization
Start by creating a 2D array dp where dp[i][j] stores the length of the LCS for the first i characters of string X and the first j characters of string Y.
Filling the DP Table
If X[i-1] Y[j-1], then dp[i][j] dp[i-1][j-1] 1. Otherwise, dp[i][j] max(dp[i-1][j], dp[i][j-1]).The base case initializes all elements along the first row and first column of the DP table to 0.
Backtracking to Find All LCS
Once the DP table is filled, the LCS can be reconstructed by backtracking. The following recursive function can be used to backtrack to find all the LCS strings:
Recursive Backtracking
Use a helper function that takes the current i and j indices and the current LCS string being constructed. If you reach a base case where either index is 0, store the current LCS string. If the characters match X[i-1] Y[j-1], include that character in the current LCS and move diagonally in the DP table. If they don't match, move in the direction of the larger value: move left if dp[i-1][j] dp[i][j-1], move up if dp[i][j-1] > dp[i-1][j].Step 3: Avoiding Duplicates
To efficiently manage and avoid duplicates in the LCS results, use a set to store the LCS results. Whenever an LCS is found, add it to the set to ensure uniqueness.
Example Code
The following Python function combines these steps to find all LCSs:
def all_lcs(X, Y): m, n len(X), len(Y) dp [[0] * (n 1) for _ in range(m 1)] # Step 1: Fill the DP table for i in range(1, m 1): for j in range(1, n 1): if X[i - 1] Y[j - 1]: dp[i][j] dp[i - 1][j - 1] 1 else: dp[i][j] max(dp[i - 1][j], dp[i][j - 1]) # Step 2: Backtrack to find all LCS def backtrack(i, j): if i 0 or j 0: return set() if X[i - 1] Y[j - 1]: return {lcs X[i - 1] for lcs in backtrack(i - 1, j - 1)} return backtrack(i - 1, j).union(backtrack(i, j - 1)) return backtrack(m, n)
Example Usage
X "abcde" Y "ace" lcs_list all_lcs(X, Y) print(lcs_list)
Complexity
Time Complexity: O(m * n) for filling the DP table, where m and n are the lengths of the two strings. The backtracking step could take up to exponential time in the worst case, depending on the number of LCSs.
Space Complexity: O(m * n) due to the DP table used for storing intermediate results.
This approach provides a systematic and efficient way to find all the longest common subsequences of two strings, making it particularly useful in various text analysis and bioinformatics applications.