Understanding The Boyer-Moore Horspood algorithm using C# code

The Boyer-Moore Horspood algorithm is consider the most efficient string-matching algorithm. It can be used in text editors search, commands substitutions, highlight matching sub string.

It works the fastest when the search pattern is relatively long.

What is Boyer-Moore-Horspool Algorithm ?

Boyer-Moore-Horspool algorithm is for finding substrings into strings. This algorithm compares each characters of substring to find a word or the same characters into the string. When characters do not match, the search jumps to the next matching position in the pattern by the value indicated in the Bad Match Table.

What is Bad Match Table?

The Bad Match Table indicates how many jumps should it move from the current position to the next. This table contains the length to shift our substring by number of steps when a bad match occurs.

The whole process of this algorithm has been divided into two stages:

  • Generate Bad match table
  • Search Process

Stage 1 – Generate Bad match table

As explained above, Bad match table is a two dimensional array, where first row contains each character in substring (search pattern), and second row contains integer values. The value for each character is calculated using this formula:

 Value = length of substring – index of each letter in the substring – 1

For e.g. if our substring is “abcdbb” then the batch match table should be:

substring characterabcd*
value (first value calculated as 6-0-1) 51
3
2
6
  • Note that the value of the last letter and other letters that are not in the sub-string will be the length of the sub-string . In our case above, it is set to 6.
  • Also note that if there is any duplicate character in our search string, the previous value for that character is replaced with new value. That is why you can see that in our case the value of character b has been changed from 4 to 1.

How Bad match table helps?

Because of Bad match table, this technique gives a good search performance because it avoid lots of needless comparisons by significantly shifting pattern relative to text.

Here is a quick C# class that is responsible to to generate a bad match table for any search pattern (substring):

using System;
using System.Collections.Generic;

namespace Learning.BoyerMooreHorspoolSearch
{
    public interface IBadMatchTable
    {
        Dictionary<int, int> Table { get; }
        int NextJump(int character);
    }

    public class BadMatchTable : IBadMatchTable
    {
        private readonly Lazy<Dictionary<int, int>> _table;
        private readonly string _pattern;

        public BadMatchTable(string pattern)
        {
            _table = new Lazy<Dictionary<int, int>>(() => GenerateTable(pattern));
            _pattern = pattern;
        }

        public Dictionary<int, int> Table => _table.Value;

        public int NextJump(int character)
        {
            try
            {
                return _table.Value[character];
            }
            catch
            {
                // return default value when there is nothing in the bad match table.
                return _pattern.Length;
            }
        }

        private Dictionary<int, int> GenerateTable(string pattern)
        {
            var table = new Dictionary<int, int>(pattern.Length);

            // Last character distance value has to be equal to pattern length, so we will just ignore that for now.
            for (int idx = 0; idx < pattern.Length - 1; idx++)
            {
                table[pattern[idx]] = pattern.Length - idx - 1;
            }

            return table;
        }
    }
}

You can try running this code by using these test cases:

using Learning.BoyerMooreHorspoolSearch;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace LearningTests
{
    [TestClass]
    public class BadMatchTableTests
    {
  
        [TestMethod]
        public void BadMatchTable_Duplicate_Character_MustReturnsValidResponse()
        {
            var sut = new BadMatchTable("happily");
            Assert.AreEqual(5, sut.Table.Count);
            var expectedValues = new int[] { 6, 5, 3, 2, 1 };
            int i = 0;
            foreach (var item in sut.Table)
            {
                Assert.AreEqual(expectedValues[i++], item.Value);
            }
        }
        [TestMethod]
        public void BadMatchTable_NextJumpForDuplicateMustMatch()
        {
            var sut = new BadMatchTable("happily");
            Assert.AreEqual(3, sut.NextJump('p'));
        }
        [TestMethod]
        public void BadMatchTable_NextJumpForNonMatchedCharacter_Should_return_substringlength()
        {
            var sut = new BadMatchTable("happily");
            Assert.AreEqual(7, sut.NextJump(' '));
        }
    }
}

Now we have generated a “bad match table” for our search phrase “happily“, we will now use this table to search “happily” in a text:

mobile citi was happy to oblige to another request happily.”

substring characterha
p
i
l
*
value (first value calculated as 6-0-1) 653
2
1
7

Stage 2 – Search Process

In this part of the process, the substring is compared from the last character. If the character is not matched then the bad match table is used to skip characters. We search for that character from text, inside of the bad match table. If the character is not found then we use default value. In our case it will be 7.

Let us execute couple of steps to understand the whole process of searching text:

By now you must have got an idea about how the whole process of searching works. If not, don’t worry, I have got you a nice piece of C# code that you can copy and and paste in your project and then run couple of test cases to see the magic of this search:

namespace Learning.BoyerMooreHorspoolSearch
{
    public class StringSearchMatch
    {
        public int StartIndex { get; set; }
        public int Length { get; set; }
    }

 public class BoyerMooreHorspool
    {
        private IBadMatchTable _badMatchTable;

        public BoyerMooreHorspool(IBadMatchTable badMatchTable)
        {
            _badMatchTable = badMatchTable;
        }

        public IEnumerable<StringSearchMatch> Search(string text, string pattern)
        {
            int currentStartIndex = 0;
            while (currentStartIndex <= text.Length - pattern.Length)
            {
                int charactersLeftToMatch = pattern.Length - 1;
                while (charactersLeftToMatch >= 0 && string.Equals(pattern[charactersLeftToMatch], text[currentStartIndex + charactersLeftToMatch]))
                {
                    charactersLeftToMatch--;
                }
                if (charactersLeftToMatch < 0)
                {
                    yield return new StringSearchMatch { StartIndex = currentStartIndex, Length = pattern.Length };
                    currentStartIndex += pattern.Length;
                }
                else
                {
                        
                    currentStartIndex += _badMatchTable.NextJump(text[currentStartIndex + pattern.Length - 1]) ;
                }
            }
        }
    }
}

You can try running this code by using these test cases:

using Learning.BoyerMooreHorspoolSearch;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Linq;

namespace LearningTests
{
    [TestClass]
    public class BoyerMooreHorspoolSearchTests
    {

        [TestMethod]
        public void BoyerMooreHorspool_MustReturn_SingleCount()
        {
            var text = "Mobile citi was happy to oblige to another request happily.".ToLower();
            var sut = new BoyerMooreHorspool(new BadMatchTable("happily"));
            var searchResult = sut.Search(text, "happily");
            Assert.AreEqual(1, searchResult.Count());
            Assert.AreEqual(51, searchResult.First().StartIndex);
        }
    }
}

In summary, this whole search process is based on these three steps:

  • If the substring letter matches, then compare with the preceding (backward) letter .
  • If it doesn’t match, check mismatched character (from text) value in the Bad Match Table. In case there is no character value found, use the default value that is search pattern length.
  • Then, skip the number of spaces that the table value indicates, and repeat the whole process until you reached till the end.

Analysis:

  • Boyer-Moore algorithm is extremely fast on large alphabet (relative to the length of the pattern).
  • Preprocessing: Uses only Bad-character shift. Efficient bad matah table for small alphabets.
  • Best case Θ(n/m)
  • Worst case Θ(nm)