Algorithms Chapter 20 Algorithms Sequential Parallel And Distributed String Matching And Document Processing Glance

subject Type Homework Help
subject Pages 9
subject Words 1754
subject Authors Jerome L. Paul, Kenneth A. Berman

Unlock document.

This document is partially blurred.
Unlock all pages and 1 million more documents.
Get Access
Algorithms: Sequential, Parallel and Distributed 1-1
Chapter 20
String Matching and Document Processing
At a Glance
Table of Contents
Overview
Objectives
Instructor Notes
Solutions to Selected Exercises
page-pf2
Algorithms: Sequential, Parallel and Distributed 1-2
Lecture Notes
Overview
In Chapter 20 we discuss string matching algorithms including some classical algorithms
based on preprocessing the pattern string, an approximation string matching algorithm
and string matching algorithms based on preprocessing the text. We begin the chapter by
discussing the string matching problem and the naïve algorithm for solving the problem.
In the next three chapters we discuss three standard string-matching algorithms, the
Knuth, Morris, and Pratt (KMP) algorithm; the Boyer and Moore (BM) algorithm; and the
Karp and Rabin (KR) algorithm. The KMP and BM string matching algorithms preprocess
the pattern string so that information gained during a search for a match of the pattern
facilitate very fast searching of a pattern string.
Chapter Objectives
After completing the chapter the student will know:
The KMP string-matching algorithm
The BM string-matching algorithm
The KR string-matching algorithm
The concept of the edit distance between two strings
A dynamic programming algorithm for computing the edit distance
The concepts of trie, compressed trie (PATRICIA tree), and suffix tree
Algorithms for searching tries, compressed tries and suffix trees for a given pattern
string
page-pf3
Algorithms: Sequential, Parallel and Distributed 1-3
Instructor Notes
The Karp-Rabin string matching algorithm can actually be considered as a Las Vegas
probabilistic algorithm (see Chapter 24). However, since the only probabilistic component of
the algorithm is to randomly choose the modulus q, it is appropriate to consider the algorithm
in this chapter before the notion of probabilistic algorithms is introduced.
NOTE: Ken needs to exand this. In section 20.5 we discuss an algorithm for determining
whether a pattern string P[0:m 1]
is a k-approximation of a text string T[0:n 1].
Solutions to Selected Exercises
Section 20.2 The Knuth-Morris-Pratt Algorithm
20.1 For this exercise we use the following version of KMPStringMatcher (the version given
in the first two printings of the text contains some redundant text comparisons). We show
that the strings P = 0m-11 and T = 0n that exhibit worst case (strongly asymptotic) mn
performance for the naïve string matcher also exhibit worst case 2n performance for
KMPStringMatcher. However, the improvement of KMPStringMatcher over the naïve
CreateNext(P[0:m 1], Next[0:m 1])
while i < n do
if P[j] = T[i] then
if j = m 1 then //match found at position i m + 1
return(i m + 1)
page-pf4
could begin at position i Next[j] in T
endif
endif
endwhile
return( 1)
20.3 For the string P = “cincinnati”, we have Next[0:9] = [0,0,0,0,1,2,3,0,0,0]. In part (a) of
the following figure, we show a trace of the values of i and j for each iteration of the while loop
in KMPStringMatcher. Positions marked with ! are where in the first statement of the while
(a)
c
i
n
c
i
a
t
t
i
_
i
s
c
i
n
c
i
n
n
a
t
i
_
m
i
s
s
e
c
i
n
c
i
n
a
t
i
c
i
c
i
n
n
a
t
i
c
i
n
c
i
n
n
t
i
c
i
n
c
i
n
a
t
i
c
i
n
c
i
n
a
t
i
c
i
n
c
n
n
a
t
i
c
i
n
i
n
n
a
t
i
c
i
c
i
n
n
a
t
i
c
n
c
i
n
n
a
t
i
i
n
c
i
n
n
a
t
i
c
i
n
c
i
n
n
a
t
i
m
a
c
(b)
page-pf5
20.4 We prove CreateNext is correct by induction on the loop variable i. (We remark that in
the first two printings of the text, the assignment j Next[j 1] made within the while loop
should be replaced by j Next[j].)
Basis steps: i = 0,1. Trivial
Induction step: Assume that CreateNext correctly computes Next[i 1], where i > 1. Now
consider the computation of the while loop for i. First notice that for the beginning of first
iteration of the while loop with the value i, the variable j has the value Next[i 1]. To see this,
20.5 Following similar reasoning to that used to establish the Key Fact on p. 637 in the text,
20.7
function SimplifiedBMStringMatcher(P[0:m 1],T[0:n 1], A)
Input: P[0:m 1] (a pattern string of length m)
T[0:n 1] (a text string of length n)
A (finite alphabet from which characters for P and T are drawn)
page-pf6
20.8 Consider the pattern string P[0: m 1] = 10m-1, and text string T[0: n 1] = 0n. Note that
the shift value corresponding to the character 0 has the value of Shift[0] = 1. Since this is the
20.10 a. For the pattern P = “01212121” all entries of the array Shift[0: |A| 1] not indexed
by 0,1,2 have value 8. Thus,
s1 = 8 k, k = 1,2, …, 7, c
{0,1,2}.
For c
{0,1,2}, we have
20.11 For the pattern P = “amalgam” all entries of the array Shift[0: |A| 1] not indexed
by a, g, l, m, have value 7. Thus,
s1 = 7 k, k = 1,2, …, 6, c
{a, l, g, m}.
page-pf7
Algorithms: Sequential, Parallel and Distributed 1-7
s1 = 5 k , c = m.
The good suffix shifts for P are 7,5,5,5,5,5, for k = 1,2, … ,6, respectively.
!
!
!
!
!
A
d
a
_
g
a
m
e
l
y
_
a
m
a
s
s
e
s
_
a
m
a
l
g
a
m
i
n
a
m
a
l
g
a
m
a
m
a
l
g
a
m
a
m
a
l
g
a
m
a
m
a
l
g
a
m
a
m
a
l
g
a
m
a
m
a
l
g
a
m
Section 20.4 The Karp-Rabin String-Matching Algorithm
20.15 a.
s
s
s
0 0 0 2 2
b.
s
T
)11(
s
T
s
0 0 0 2 2
1 0 2 4 2
page-pf8
Algorithms: Sequential, Parallel and Distributed 1-8
20.16 Assuming that each entry in the m × m patterns P[0: m 1, 0: m 1] and T[s: s + m
20.17 We have
]2[]3[][]1[ 21
1++++++=
msTkmsTksTksTT mm
s
,
20.19 a)
procedure EditDistance(T[0:n 1],P[0:m 1],D[0:n,0:m])
Input: T[0:n 1],P[0:m 1] (strings)
Output: D[0:n,0:m] (array such that D[i,j] is the edit distance between of T[0:i 1] and
P[0:j 1] )
for i 0 to n do //initialize for boundary conditions
D[i,0] i
b) Using text comparisons as the basic operation, it is clear that EditDistance has Θ(nm)
complexity.
20.21
p
a
t
r
i
c
i
a
D
0
1
2
3
4
5
6
7
8
0
0
1
2
3
4
5
6
7
8
P
1
1
0
1
2
3
4
5
6
7
page-pf9
Algorithms: Sequential, Parallel and Distributed 1-9
A
2
2
1
0
1
2
3
4
5
6
T
3
3
2
1
0
1
2
3
4
5
R
4
4
3
2
1
0
1
2
3
4
I
5
5
4
3
2
1
0
1
2
3
A
6
6
5
4
3
2
1
2
3
4
R
7
7
6
5
4
3
2
2
3
4
C
8
8
7
6
5
4
3
2
3
4
H
9
9
8
7
6
5
4
3
3
4
Section 20.5 Tries and Suffix Trees
20.22 Given a collection of strings C, where s is the total length over all strings in C, note that
the maximum number of nodes N(T) in the standard tree T for C occurs when each string has a
20.23 The algorithm proceeds by successively processing each string in C. When a new string
x is added to the tree, the algorithm checks whether x has a prefix that agrees with a prefix of a
string already stored in a child of the root. If not, a node containing x is added to the children
20.24 The algorithm checks to see if there is a child of the root whose string has a prefix that
agrees with a prefix of the given search string. If not, the string is not in the collection C. If so,
20.25 If ST is the compressed suffix (trie) tree for a text string T, and P[0:m-1] is a pattern
string, then we follow a path in ST in the manner described in the previous exercise. However,
|

Trusted by Thousands of
Students

Here are what students say about us.

Copyright ©2022 All rights reserved. | CoursePaper is not sponsored or endorsed by any college or university.