Longest Common Subsequence

From Algorithmist
Jump to: navigation, search

Longest Common Subsequence is the problem of finding the longest common subsequence of two sequences of items. This is used in the "diff" file comparison utility.

The solution utilizes dynamic programming.

Contents

[edit] Overview

The problem is usually defined as:

Given two sequence of items, find the longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, in the string abcdefg, "abc", "abg", "bdf", "aeg" are all subsequences.

A naive exponential algorithm is to notice that a string of length n has O(2n) different subsequences, so we can take the shorter string, and test each of its subsequences for presence in the other string, greedily.

[edit] Recursive solution

We can try to solve the problem in terms of smaller subproblems. We are given two strings x and y, of length n and m respectively. We solve the problem of finding the longest common subsequence of x = x1...n and y = y1...m by taking the best of the three possible cases:

  1. The longest common subsequence of the strings x1...n - 1 and y1...m
  2. The longest common subsequence of the strings x1...n and y1...m - 1
  3. If xn is the same as ym, the longest common subsequence of the strings x1...n - 1 and y1...m - 1, followed by the common last character.

The base case: when one of the sequences is empty, their only common subsequence is the empty sequence of length 0.

It is easy to construct a recursive solution from this (in Python):

def lcs_len(x, y):
    """This function returns length of longest common sequence of x and y."""
 
    if len(x) == 0 or len(y) == 0:
        return 0
 
    xx = x[:-1]   # xx = sequence x without its last element    
    yy = y[:-1]
 
    if x[-1] == y[-1]:  # if last elements of x and y are equal
        return lcs_len(xx, yy) + 1
    else:
        return max(lcs_len(xx, y), lcs_len(x, yy))

and this is in C++ : --Mohamed Essam Arafa 15:57, 5 December 2012 (EST)

int lcs_len(int x,int y){//due to c++ limitations I used (int x,int y ) to express the end of the sequence , st and nd are two strings
	if(x==-1||y==-1){//Invalid pointers
		return 0;
	}
	int xx=x-1;
	int yy=y-1;
	if(st[x]==nd[y]){ // if the last element of both is equal
		return lcs_len(xx,yy)+1;
	}
	else{
		return max(lcs_len(xx,y),lcs_len(x,yy));
	}
}

and this is in C :

int lcs_len(char *s, char *t, int s_len, int t_len){//call it like this lcs_len(s, t, strlen(s)-1, strlen(t)-1) 
	if(s_len < 0 || t_len < 0){
		return 0;
	}
	if(s[s_len] == t[t_len])
	        return lcs_len(s, t, s_len-1, t_len-1)+1;
	return max(lcs_len(s, t, s_len-1, t_len),lcs_len(s, t, s_len, t_len-1));
}

[edit] Dynamic programming

Obviously, this is still not very efficient. But because the subproblems are repeated, we can use memoization. An even more (slightly) efficient way, which avoids the overhead of function calls, is to order the computation in such a way that whenever the results of subproblems are needed, they have already been computed, and can simply be looked up in a table. This is called Dynamic Programming.

In this case, we find lcs(x1..i,y1..j) for every i and j, starting from smaller ones, storing the results in an array at index (i,j) as we go along.

def lcs(x, y):
    n = len(x)
    m = len(y)
    table = dict()  # a hashtable, but we'll use it as a 2D array here
 
    for i in range(n+1):     # i=0,1,...,n
        for j in range(m+1):  # j=0,1,...,m
            if i == 0 or j == 0:
                table[i, j] = 0
            elif x[i-1] == y[j-1]:
                table[i, j] = table[i-1, j-1] + 1
            else:
                table[i, j] = max(table[i-1, j], table[i, j-1])
 
    # Now, table[n, m] is the length of LCS of x and y.
 
    # Let's go one step further and reconstruct
    # the actual sequence from DP table:
 
    def recon(i, j):
        if i == 0 or j == 0:
            return []
        elif x[i-1] == y[j-1]:
            return recon(i-1, j-1) + [x[i-1]]
        elif table[i-1, j] > table[i, j-1]: #index out of bounds bug here: what if the first elements in the sequences aren't equal
            return recon(i-1, j)
        else:
            return recon(i, j-1)
 
    return recon(n, m)

Notice how closely it parallels the recursive solution above, while entirely eliminating recursive calls. This "small" change makes the difference between exponential time and polynomial time.

[edit] Further reading

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox