星期四, 十月 12, 2006

基本算法连载(10)-模式匹配之BM(Boyer-Moore)

周末两天被BM算法折磨的要死。《a fast string search algorithm》论文中提到的算法思想倒是理解的差不多,但网上(http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140)给出的实现可就是看不懂。通过Baidu,Google一搜,可以看到很多Boyer-Moore的实现。但绝大部分实现都是简化版本,只考虑了bad-character shift,而忽略了good-suffix shift。
它思想的源泉是从右向左匹配字符串将获得更多有用的信息。The algorithm precomputes two tables to process the information it obtains in each failed verification: one table calculates how many positions ahead to start the next search based on the identity of the character that caused the match attempt to fail;the other makes a similar calculation based on how many characters were matched successfully before the match attempt failed.其中前一个称作bad-character shift,后一个称作good-suffix shift。想详细了解Boyer-Moore的思想,我极力推荐看作者发表的论文,它比网上对这个算法介绍的文章容易理解的多。
处理主串为n,模式串为m的情况,此算法的最好表现是:n/m。在这种情况下,只有模式串中的最后一个字符需要比较。不相等,马上可以跳过m个字符。从这里可以看出此算法一个违反直觉的性质:模式串越长,搜索越快。
最优的代码实现(没看懂,明天贴个我自己改的):
void preBmBc(char *x, int m, int bmBc[]) {
int i;

for (i = 0; i < ASIZE; ++i)
bmBc[i] = m;
for (i = 0; i < m - 1; ++i)
bmBc[x[i]] = m - i - 1;
}


void suffixes(char *x, int m, int *suff) {
int f, g, i;

suff[m - 1] = m;
g = m - 1;
for (i = m - 2; i >= 0; --i) {
if (i > g && suff[i + m - 1 - f] < i - g)
suff[i] = suff[i + m - 1 - f];
else {
if (i < g)
g = i;
f = i;
while (g >= 0 && x[g] == x[g + m - 1 - f])
--g;
suff[i] = f - g;
}
}
}

void preBmGs(char *x, int m, int bmGs[]) {
int i, j, suff[XSIZE];

suffixes(x, m, suff);

for (i = 0; i < m; ++i)
bmGs[i] = m;
j = 0;
for (i = m - 1; i >= -1; --i)
if (i == -1 || suff[i] == i + 1)
for (; j < m - 1 - i; ++j)
if (bmGs[j] == m)
bmGs[j] = m - 1 - i;
for (i = 0; i <= m - 2; ++i)
bmGs[m - 1 - suff[i]] = m - 1 - i;
}


void BM(char *x, int m, char *y, int n) {
int i, j, bmGs[XSIZE], bmBc[ASIZE];

/* Preprocessing */
preBmGs(x, m, bmGs);
preBmBc(x, m, bmBc);

/* Searching */
j = 0;
while (j <= n - m) {
for (i = m - 1; i >= 0 && x[i] == y[i + j]; --i);
if (i < 0) {
OUTPUT(j);
j += bmGs[0];
}
else
j += MAX(bmGs[i], bmBc[y[i + j]] - m + 1 + i);
}
}


没有评论: