Edit Distance Problem
Welcome to Day 36 of our 60 Days of Coding Algorithm Challenge! Today, we’ll explore the Edit Distance Problem, a classic dynamic programming problem with applications in natural language processing, bioinformatics, and more.
What is the Edit Distance Problem?
The Edit Distance Problem, also known as the Levenshtein Distance Problem, is about finding the minimum number of operations required to transform one string into another. The allowed operations are:
- Insert a character
- Delete a character
- Replace a character
For example, the edit distance between “kitten” and “sitting” is 3, as we need to:
- Replace ‘k’ with ’s’
- Replace ’e’ with ‘i’
- Insert ‘g’ at the end
Naive Recursive Approach
Let’s start with a naive recursive implementation:
1def edit_distance_naive(str1, str2, m, n):
2 # If first string is empty, insert all characters of second string
3 if m == 0:
4 return n
5
6 # If second string is empty, remove all characters of first string
7 if n == 0:
8 return m
9
10 # If last characters are same, …
Edit Distance Problem
Welcome to Day 36 of our 60 Days of Coding Algorithm Challenge! Today, we’ll explore the Edit Distance Problem, a classic dynamic programming problem with applications in natural language processing, bioinformatics, and more.
What is the Edit Distance Problem?
The Edit Distance Problem, also known as the Levenshtein Distance Problem, is about finding the minimum number of operations required to transform one string into another. The allowed operations are:
- Insert a character
- Delete a character
- Replace a character
For example, the edit distance between “kitten” and “sitting” is 3, as we need to:
- Replace ‘k’ with ’s’
- Replace ’e’ with ‘i’
- Insert ‘g’ at the end
Naive Recursive Approach
Let’s start with a naive recursive implementation:
1def edit_distance_naive(str1, str2, m, n):
2 # If first string is empty, insert all characters of second string
3 if m == 0:
4 return n
5
6 # If second string is empty, remove all characters of first string
7 if n == 0:
8 return m
9
10 # If last characters are same, ignore last char and recur for remaining string
11 if str1[m-1] == str2[n-1]:
12 return edit_distance_naive(str1, str2, m-1, n-1)
13
14 # If last characters are different, consider all possibilities and find minimum
15 return 1 + min(edit_distance_naive(str1, str2, m, n-1), # Insert
16 edit_distance_naive(str1, str2, m-1, n), # Remove
17 edit_distance_naive(str1, str2, m-1, n-1)) # Replace
18
19# Example usage
20str1 = "kitten"
21str2 = "sitting"
22print(f"Edit Distance between '{str1}' and '{str2}': {edit_distance_naive(str1, str2, len(str1), len(str2))}")
This approach has a time complexity of O(3^(m+n)), making it inefficient for longer strings.
Dynamic Programming Approach
We can significantly improve the efficiency using Dynamic Programming.
Top-Down Approach (Memoization)
1def edit_distance_memo(str1, str2):
2 m, n = len(str1), len(str2)
3 memo = {}
4
5 def dp(i, j):
6 if i == 0:
7 return j
8 if j == 0:
9 return i
10
11 if (i, j) in memo:
12 return memo[(i, j)]
13
14 if str1[i-1] == str2[j-1]:
15 memo[(i, j)] = dp(i-1, j-1)
16 else:
17 memo[(i, j)] = 1 + min(dp(i, j-1), # Insert
18 dp(i-1, j), # Remove
19 dp(i-1, j-1)) # Replace
20 return memo[(i, j)]
21
22 return dp(m, n)
23
24# Example usage
25str1 = "kitten"
26str2 = "sitting"
27print(f"Edit Distance between '{str1}' and '{str2}': {edit_distance_memo(str1, str2)}")
Bottom-Up Approach (Tabulation)
1def edit_distance_tab(str1, str2):
2 m, n = len(str1), len(str2)
3 dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
4
5 # Fill d[][] in bottom up manner
6 for i in range(m + 1):
7 for j in range(n + 1):
8 # If first string is empty, only option is to insert all characters of second string
9 if i == 0:
10 dp[i][j] = j
11
12 # If second string is empty, only option is to remove all characters of first string
13 elif j == 0:
14 dp[i][j] = i
15
16 # If last characters are same, ignore last char and recur for remaining string
17 elif str1[i-1] == str2[j-1]:
18 dp[i][j] = dp[i-1][j-1]
19
20 # If last characters are different, consider all possibilities and find minimum
21 else:
22 dp[i][j] = 1 + min(dp[i][j-1], # Insert
23 dp[i-1][j], # Remove
24 dp[i-1][j-1]) # Replace
25
26 return dp[m][n]
27
28# Example usage
29str1 = "kitten"
30str2 = "sitting"
31print(f"Edit Distance between '{str1}' and '{str2}': {edit_distance_tab(str1, str2)}")
Both Dynamic Programming approaches have a time complexity of O(mn) and a space complexity of O(mn), where m and n are the lengths of the input strings.
Printing the Edits
We can modify our solution to also print the actual edits needed:
1def edit_distance_with_path(str1, str2):
2 m, n = len(str1), len(str2)
3 dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
4
5 # Fill dp[][] in bottom up manner
6 for i in range(m + 1):
7 for j in range(n + 1):
8 if i == 0:
9 dp[i][j] = j
10 elif j == 0:
11 dp[i][j] = i
12 elif str1[i-1] == str2[j-1]:
13 dp[i][j] = dp[i-1][j-1]
14 else:
15 dp[i][j] = 1 + min(dp[i][j-1], dp[i-1][j], dp[i-1][j-1])
16
17 # Backtrack to find the edits
18 edits = []
19 i, j = m, n
20 while i > 0 and j > 0:
21 if str1[i-1] == str2[j-1]:
22 i -= 1
23 j -= 1
24 elif dp[i][j] == dp[i-1][j-1] + 1:
25 edits.append(f"Replace '{str1[i-1]}' with '{str2[j-1]}'")
26 i -= 1
27 j -= 1
28 elif dp[i][j] == dp[i-1][j] + 1:
29 edits.append(f"Delete '{str1[i-1]}'")
30 i -= 1
31 elif dp[i][j] == dp[i][j-1] + 1:
32 edits.append(f"Insert '{str2[j-1]}'")
33 j -= 1
34
35 while i > 0:
36 edits.append(f"Delete '{str1[i-1]}'")
37 i -= 1
38 while j > 0:
39 edits.append(f"Insert '{str2[j-1]}'")
40 j -= 1
41
42 return dp[m][n], reversed(edits)
43
44# Example usage
45str1 = "kitten"
46str2 = "sitting"
47distance, edits = edit_distance_with_path(str1, str2)
48print(f"Edit Distance between '{str1}' and '{str2}': {distance}")
49print("Edits:")
50for edit in edits:
51 print(edit)
Applications of Edit Distance
- Spell checking and correction
- DNA sequence alignment in bioinformatics
- Plagiarism detection
- Machine translation and natural language processing
- Fuzzy string searching
Exercise
Implement a version of the Edit Distance algorithm where the costs of insertion, deletion, and replacement are different.
Modify the algorithm to find the longest common subsequence of two strings.
Implement a function that uses Edit Distance to find similar words in a dictionary (like a simple spell checker).
Summary
Today, we explored the Edit Distance Problem, a classic example of dynamic programming. We implemented various approaches, from naive recursion to optimized dynamic programming solutions, and even a version that provides the actual sequence of edits.
Understanding the Edit Distance problem and its solutions provides valuable insights into dynamic programming techniques and their applications in string manipulation and comparison problems.
Tomorrow, we’ll dive into another fundamental algorithm: the Bellman-Ford algorithm for finding shortest paths in a graph with negative edge weights. Stay tuned!