Algebra of Programming Research Group: 2005 Minutes

Minutes of 2005 Meetings of the Algebra of Programming Research Group

Michaelmas Term 2005 (all Fridays)

2nd December
4th November
28th October
14th October
7th October

Trinity Term 2005 (all Fridays)

17th June
3rd June
6th May

Friday, 2nd December 2005

Present: JG, GJ, BJS, RSB, BCDSO, CEM, IMB (minutes by JG and CEM)

We discussed matrix transposition, which underlies RSB's question from a week or two ago about how to show that his rows, cols and boxs functions for manipulating sudoku grids are their own inverses. If you represent a matrix as a list of lists, the proof that transposition is its own inverse is a pig (not least because it isn't true, on account of ragged lists).

However, if you express matrices as fixed-size vectors of vectors, the result is certainly true; it even turns out to be easy to prove. JG showed it for triples.

We then set about trying to generalize to vectors of arbitrary size, using the GADT techniques discussed recently. We nearly managed the construction, and JG completed it later. However, BCDSO convinced him that the approach was wrong, and showed a much simpler approach inspired by Conor McBride's idioms.

Friday, 4th November 2005

Present: RSB, GJ, SAC, CEM, BS (minutes by SAC)

SAC presented some more information about the hats problem.

[Editor's note: We now have a reference for this problem [1].]

SAC wrote up a number on the whiteboard:

                 n-1
               n2
              3

This is the number of deterministic strategies the group can have. This comes from a group strategy being made up of individual strategies. There are

         n-1
        2
       3

possibilities for an individual strategy, as each individual must decide what to do for each of 2^{n-1} possible hat combinations of the other players.

SAC has written a program to check all possible strategies (she had a dynamic programming algorithm in mind for larger numbers, but wanted to get some exact optimal numbers for the base cases).

The trouble is, that's a rather large number of strategies. For 4 people, it's 3^32 - too many to check on a computer using a simple naive generate-and-test program.

Whilst you can halve the number of strategies to check by noting that there is a red/black symmetry (any strategy is essentially the same if red is replaced by black and vice versa), and shuffling the order of the hat wearers gives you essentially the same strategy, that doesn't reduce the number of strategies to check by much.

SAC said that interestingly enough, whilst the standard 75% chance strategy for 3 hat-wearers is optimal, it isn't the only optimal such strategy! There is another strategy - well 3, but the three are essentially the same if you change the roles of the players.

Inspired by this news, RSB, GJ, CEM and BS immediately set to figuring out this other 3-person strategy, which they did.

   Person 0          Person 1          Person 2

   _rr -> r          r_r -> p          rr_ -> p
   _rb -> p          r_b -> r          rb_ -> r
   _br -> p          b_r -> b          br_ -> b
   _bb -> b          b_b -> p          bb_ -> p

Described in words, this strategy is as follows:

Person 0: if you see two hats the same, write down that colour
Persons 1,2: if you see two hats the same, pass, otherwise write down 0's hat colour

Checking the survival chances are 75%:

   Hats    Responses

   rrr     rpp          survive
   rrb     prp          survive
   rbr     ppr          survive
   brr     rbb          die
   bbr     pbp          survive
   brb     ppb          survive
   rbb     brr          die
   bbb     bpp          survive

Thinking about larger groups of hat-wearers, from last week we know a good strategy for seven people: get the first four to write down the opposite colour if they see three of the same, otherwise pass. 1/8 of the time that will result in failure, 3/8 of the time that will result in everyone passing. The remaining three can see what the first four will do, so if the 4-group pass, then they use the 3-strategy to improve the group's chances (to 25/32).

In general, if there are 4n+3 people, you can get the first four to do the 4-strategy, then if they all pass, then the next four have a go, and so on. You can think of the 4-groups as "non-final" groups, and the last 3 as a "final group". The final group will try as much as possible to maximise the chance of success. The non-final groups will try very hard not to fail but don't necessarily have a large probability of success.

The problem with this strategy is that because the first 4-group have a 1/8 chance of failing outright, the remaining groups can't improve the chances beyond 7/8. This can be overcome by using a bigger group than 4 for the first non-final group. Then you can use dynamic programming to find the best way to partition into groups. Gavin Lowe has such a program, and found that (e.g.) for 420 hat-wearers, the chances were improved to approximately 0.9

Of course this doesn't say whether there might be a better way to partition, e.g. a non-final group into non-final groups, or whether there might be some entirely different strategy that is better.

RSB wanted to see a specification of the problem; SAC wrote up the outline of her program:

    bestStrategies n = bestOf (successRate (allHats n)) (allStrategies n)

    allHats n = getAll n ['r','b']

    allStrategies n = getAll n (allIndividualPlans n)

    allIndividualPlans n = map (zip (allHats (n-1)))
                             (getAll (len (allHats (n-1)) ['r','b','p'])

    getAll n xs = cp (copy n xs)

GJ pointed out that this was inefficient in use of space: only the response for each hat combination need to be kept and so you don't use all this zip stuff. It was also suggested that rather than represent an individual strategy as a function from hat combinations of n-1 people, it should be represented as a function from hat combinations of n people, but ignoring the hat colour of the person whose strategy it is.

RSB mused that this sort of problem is reminiscent of those problems like where several (logical) people have dirty faces, some have clean, and after several clock ticks, all the people with dirty faces put up their hands

RSB said that Alexandru Baltag does work on epistemic problems.

SAC mentioned a paper [2] presented at TFP 2005 which uses the example of the dining cryptographers: they are dining at a table and either one of them has paid for the meal, or the company they work for has paid. How do they find out whether the company has paid, without revealing the identity of the donor if amongst them?

SAC recalled a suitable protocol:

Each pair of neighbours flip a coin, visible only to those two. Each person says "true" if the two coins either side of them came up the same, and says "false" if the two coins came up different. BUT the donor, if amongst them, must do the opposite. A parity check reveals whether the donor was amongst them.

---------------

There was also a short diversion into SuDoku: RSB wanted to know how many possible boards there are in SuDoku, and tried one possible method for counting but was shown why it didn't work.

"Isn't this an easy problem?"
"No" (in unison)
"Why are you so sure it's not easy?"
"Instinct"
"Everything should be presumed difficult until shown to be easy"

[Ed note - CEM added: I have just read the minutes from Friday and saw that you were looking at the number of valid Sudoku grids. I came across a paper on this a few weeks ago, which shows that there certainly are a lot.]

[1] Sara Robinson: Why Mathematicians Now Care About Their Hat Color New York Times, April 10th 2001

[2] Jan van Eijck and Simona Orzan: Modelling the epistemics of communication with functional programming TFP 2005

Friday, 28th October 2005

Present: RSB, GJ, SAC, JG, BS (minutes by SAC)

There was a short discussion about what speakers we might like to have for BCTCS 2007 (British Colloquium for Theoretical Computer Science).

SAC then talked about plans to write a book on greedy algorithms, and the difficulties of choosing how to present the different theory supporting greedy algorithm. There are various different notations for different greedy theories, including:

SAC's relational expressions like

           M =  opt R   . /\ (rep Step)        for specifications
                     g

              G  = opt R  . /\ Step            for a greedy step
                        l

               rep G   subsetof  M             to express that the greedy algorithm
                                               rep G  solves M

RSB/Oege's catamorphisms/anamorphisms style
operations research style, which has a sequence of decisions being made and a "policy" for which decision to make to implement the greedy choice
Hellman's presentation of greedy algorithms
Korte/Lovasz [1]'s set-based presentation of matroids and greedoids

A matroid is a set of sets E, such that

{} is in E
for sets X,Y, if Y is in E and X subsetof Y, then X is in E too
(exchange property) for sets X,Y in E such that |X| = |Y| + 1, there must be some x in X such that Y union {x} is in E too

Greedoids are the sequence version of matroids.

[] is in L
if x ++ y is in L then x is in L
for x,y in L such that len x > len y , there must be some a in x such that y ++ [a] is in L too

Greedy algorithms come in when you have an objective function w on the elements of the greedoid/matroid and the greedy algorithm is to select x to add to set/sequence S such that for S union {x} / S ++[x] in E/L, w(S union {x}) / w(S ++ [x]) is maximised (or minimised).

Kruskal's algorithm fits into the matroid set system, and Prim's algorithm fits into the greedoid system.

There followed much discussion about what should/shouldn't be in the book.

JG wanted to know the audience for the book. The audience would be varied, potentially spanning the interested undergraduate (who may not look at the heavy theory sections) to researchers. There is no book on greedy algorithms; SAC wants to produce *the* book on greedy algorithms, a reference book. RSB wanted "My Little Book of Greedy Algorithms", with the greedy theory in a separate reference book.

It was agreed that representing greedy algorithms using sequences was a good idea as whilst you can have greedy algorithms on other data structures, it has been shown you can represent folds/unfolds as iterations anyway, and there weren't many non-sequential greedy algorithms.

SAC suggested maybe the concept of state and state transition might be a unifying theme?

RSB wanted to be presented with a problem that had a non-obvious greedy solution. We discussed the box stacking problem. GJ asked whether the boxes have thickness. You can effectively ignore the thickness of the cardboard to treat them like this:

Boxes have a width w and breadth b (we are not concerned about the height), and one box i will fit inside box j if

              w  <  w    &   b  <  b
               i     j        i     j

          \/    w   <  b  &   b   < w
                 i      j      i     j

(no funny fittings with two boxes next to each other inside a bigger one)

You have a big pile of cardboard boxes, and you want to make stacks of them (using the stack-inside relation above). The problem is to stack all the boxes so that there are as few stacks as possible.

Apparently there is a big box stack in the Tate Modern and SAC should go there and take photos.

SAC also drew a parallel between this problem and an activity scheduling problem (activities have start+end times, can't have two activites scheduled in the same room at the same time, want as few rooms as possible for the schedule).

These are both minimum chain problems. On the given ordering relation, you want to partition the elements into as few chains as possible (by Dilworth's theorem the number of chains is the same as the maximum size of an anti-chain).

If you order the activities by start time (a topological sort respecting the ordering relation), the obvious greedy algorithm of adding each activity to the first available room (or starting a new one if necessary) works.

The same can be done for the boxes if the correct topological sort is done. The ordering of sorting the boxes with increasing length of longer side - and for boxes equally long, in *decreasing* length of the shorter side - works.

This reminded RSB for some reason of Higman's Lemma.

Suppose you have a preorder <=

A quasi-well-order (RSB thinks that's right) is one whereby for any infinite sequence

        x0, x1, x2, ....

there exists an infinite subchain

        x  <=  x   <= x   <= .....
         i1     i2     i3

Now define <=* : [a] -> [a] -> Bool

   xs <=* ws   if there exists a subsequence ys of ws such that
                  |xs| = |ys|   and for all i   xi <=  yi

The lemma: <= a quasi-well-order means that <=* is also a quasi-well-order

This lemma is central to CSP (useful for the traces model). The proof is only half a page, but it is impredicative

JG wanted to know what impredicative was. RSB explained that it was where it included universal quantification and wasn't constructive. For example, proofs by induction construct what you're trying to prove and are not impredicative.

However, RSB mentioned that someone had managed to push it through a system which did give a constructive proof. The challenge would be to prove this lemma without 10 years of effort(!).

RSB was keen to tell us about the hats problem, before finishing. He told us this comes from Frances Kirwan's husband.

There are 4 people (or more). They each have a hat placed on their head, and enter a room. The hats are chosen to be either red or black, according to a uniform random distribution - the whole point of the hats is that each person in the room can see everyone else's hat but not their own.

Before they enter the room, they can discuss strategy. Once they are in the room they can't communicate, all they do is to (simultaneously, without looking at anyone else's, I presume?) each write down one of "red" "black" or "pass" on their own piece of paper.

If one (or more) person gets the colour of their own hat correct and no-one gets it wrong, then they all live. In all other circumstances (they all pass, someone gets it wrong) everyone in the room dies horribly.

What's their best chance of getting out of the room alive?

We agreed for 1 person, he has to guess, he has a 50% of living.

For 2 people, it looks like you can't do better than have one person guess and the other pass (no point in them both guessing). Again, 50%

For 3 people, you can do better than 50% if everyone uses this strategy: "if you see the other two have hats the same colour, write down the opposite colour" "if the other two have different hats, then pass" For example in a RRB hat situation, there would be PPB written down, which will let them live. Out of eight possibilities, RRR and BBB will result in death, but RRB, RBR, BRR, BBR, BRB, and RBB will let them live. 75% survival chance.

Gavin Lowe came up with a sneakier way to boost chances, for example when there are 7 people in the room: first split up into a group of four and a group of three. For the 4-group, can use the same strategy, with writing down the opposite colour if all 3 seen hats in that group are the same. Half the time there is a 3-1 split which results in success. 1/8th of the time it's RRRR or BBBB and they get it wrong and they die. But the rest of the time (3/8), it's a 2red-2black split, and they all pass. So the 3-group should be able to deduce what will happen to the 4-group. If they are going to live, the 3-group all pass. If they are going to die, who cares. If the 4-group all passed, then the 3-group have a 75% chance of converting it to a success. This results in an overall success chance of 25/32, which is marginally better than 75%.

With thoughts of suicidal hat-wearers in our brains, we all trooped off.

[1] Korte, Lovasz, Schrader: Greedoids, Springer-Verlag, 1990

Friday, 14th October 2005

(Minutes by SC)

Present: RSB, GJ, SAC, CEM, BS

We reminded RSB that he'd promised to tell us about Sudoku. RSB is planning to give this to his functional programming students as an example later on in term. We reminded ourselves what Sudoku was (e.g. see [1]), and then got on with giving some types

     type Matrix a = [[a]]

     type Board = Matrix Char

     sudoku :: Board -> [Board]

That is, a sudoku solver program will take an input board, and will return a list of ways it can be filled in.

There then followed a discussion about how this 2D puzzle was really better visualised as a 4D puzzle.

Back to the program development, SAC wanted some concrete examples of parameters, e.g. to see how big the board size was.

     boardsize = 9
     boxsize = 3
     cellvals = "123456789"

You also have a predicate blank for testing whether a cell is blank or not.

Where should RSB start on this puzzle with his students? How about a "generate and test"?

     sudoku = filter legal . mcp . choices

     choices = map (map choice)

     choice e = if blank e then cellvals
                else [e]

     mcp :: Matrix [a] -> [Matrix a]
     mcp = cp . map cp

(cp is the cartesian product)

GJ commented that he wouldn't have done it this way, he'd have generated all possible grids, and then use the non-blank squares on the given board later on in the algorithm.

     legal b =
           all nodups (rows b)
           &&  all nodups (cols b)
           &&  all nodups (boxs b)

     rows = id
     cols = trans
     boxs = map cols . block . map block
            where block = groupBy 3

These help illustrate the multi-dimensional nature of the boards.

Mutterings were heard about wanting to spell the latter "boxes" but RSB wanted to stick to four-letter words. Is there a nicer definition of boxs?

The next step was to try and narrow down some of the possible grids. Some pruning was needed:

     prune cm = zipWith (zipWith op) cm (used cm)

              where op cs us = if single cs
                                 then cs
                                 else cs -- us

     used cm = zipWith3 join (rvals cm) (cvals cm) (bvals cm)

(ok the -- is a comment line in Haskell but we know it means list difference)

"I'm deeply bored with this now" "a nine by nine board?"

So at the moment we have a program that's inefficient. Are the students going to understand this? What we'd like to do is show them how to treat the data structure as a whole, not just cell-by-cell

"Wholemeal programming", said GJ.

We've now got

           filter legal . mcp . prune . choices

Now what?

SAC suggested one strategy: for example in the row

     [3] [ ] [2] [ ] [1] [4] [5] [  ] [8]

if the second cell can either be 6 or 7, the fourth cell can either be 6 or 7, and the eighth cell can either be 7 or 9, then 9 has to go in the eighth cell since it can't go in the second or fourth.

Apparently this is known as a "level 2 strategy".

"It'll all be over within a year. Avian flu will kill all the Sudoku sites."

It was suggested that repeated prunings would be a good idea: "prunes go with wholemeal programming"

After that? Depth-first search and backtracking. Wouldn't it be a good idea to prune after each step? Once certain cells are allocated, there may be more pruning that you can do, which may fix more values in cells: "pruning often enough might unblock you"

"Of course, I could always iterate the prunes.... you can always do with more prunes", RSB pondered.

---------------

Discussion widened.

Comparisons with the Countdown algorithms [2] were made. Using memoization, such algorithms took little space, but a lot of garbage collection.

Also, how to generate Sudoku boards? One way is to start with a full board, and then remove values. RSB wondered whether full legal boards had the same structure? "No" said GJ and SAC immediately.

RSB wondered whether lists were intrinsic to modelling Sudoku? Although the board layout suggests lists, this isn't mandatory, really the constraints deal with sets of numbers. Maybe sets? Bags, suggested SAC. Although there are no duplicates of numbers in the lists/rows/columns, you could have duplicate constraints for a cell, e.g. in the example above two cells have the constraint of being 6 or 7. So bags might come into it somewhere.

GJ returned to the use of zipWiths. If you have an expression like

      zipWith3 op (A union B union C)      with op being set difference

then you could express it like:

      (zipWith op A )  .  (zipWith op B) . (zipWith op C)

In other words, you can do the list differences one by one. Prune by rows, then by columns, then prune by boxes.

We returned to discussing the 4D view of the board. BS and SAC suggested taking a more character-centred view of it: each character forms a 9-rooks problem (as opposed to a 9-queens problem) on the board. SAC suggested generated boards using possibilities for the rook positions of the characters, and then use repeated pruning and level 2 strategies.

We then all trooped off, not for lunch this time.

[1] Sudoku entry in Wikipedia

[2] Richard Bird, Shin-Cheng Mu, Countdown: A case study in origami programming, JFP 15(5):679-702, 2005

Friday, 7th October 2005

(Minutes by SC)

Present: RSB, IB, SC, CEM, JG

Initially we discussed what research/conferences/papers we had been involved with over the summer.

We decided that 1.30pm looked like the best time to meet on Fridays this Autumn.

---------------

RSB has a problem with an usherette in a cinema. No, not like that. An imaginary usherette.

This usherette helps people to get to their seats in the cinema. Imagine you have 5 seats in a row:

        _   _   _   _   _

You want to have people always seated in alphabetical order. First arrives Alice. Where to put Alice? Suppose you put her in position 1.

        1   _   _   _   _

Secondly arrives Abby. So you will have to move Alice, e.g. to

        _   2   _   1   _

This involves irritation for Alice, because she had to move.

Overall, we want to minimize the sum of the irritations as all five people sit down. The catch is that you don't know anything about the context, you have no idea where in the ordering the next person is likely to fit compared to the people already seated.

So the puzzle is: what's the smallest number of irritations that you can guarantee?

Various thoughts were pondered, such as the realisation that the optimality principle is not obviously true, and neither is a greedy strategy. It might be the case that irritating slightly more on one insertion allows you to irritate less on a later insertion.

By filling in people from the left hand side, you can show that the min number necessary of irritations is O(N^2).

What about a lower bound? Well consider the last person added. If the gap is in the middle, the most irritations that can result must be ceiling((N-1)/2).

The previous person addition, if the gaps are arranged at distance (N-2)/4 from the end this minimizes the worst irritation.

For the previous person, the gaps should be at distances (N-3)/6 from the end with distance (N-3)/3 between gaps.

So the lower bound is at least sum (N-i)/2i which is at least order N log N.

RSB was pondering giving a departmental seminar and said that he has a O(N log^3 N) upper bound.

Puzzle challenges for us:

For 5 seats in the cinema, show that you can GUARANTEE that no more than 6 irritations are necessary.
For 6 seats in the cinema, show that you can GUARANTEE that no more than 8 irritations are necessary.

RSB talked a little more about context of this problem. Really it is a labelling problem (an analogy would be putting books on bookshelves and keeping them ordered), with the space of labels exactly equal to the number of labels you will use.

A more general problem is that of maintaining order: You have a data structure where you want the following operations:

  insert(x,y) - insert x immediately after y in the data structure
  delete(x)
  query(x,y) - does x come before y in the data structure

If you have the labelling done a certain way then you can immediately tell the result of query(x,y) just by looking at the labels, so insertion is O(1).

We thought about various data structures where doing a sort operation (displaying the stored data in sorted order) was O(n) and the insert operation was more efficient. e.g. AVL trees have O(log n) insertion and O(n) sorted display (in-order traversal).

We then all trooped off for lunch.

Friday, 17th June 2005

(Minutes by JG)

Jamie Snape talked about his MSc project with Richard, which is to develop a loopless algorithm for enumerating combinations.

Richard's definition of a loopless algorithm is one in the form

  unfoldr step . prolog

where prolog takes time linear in the input, and step is (amortized?) constant time.

The problem in question is to enumerate combinations, the things counted by binomial coefficients. For example, the possible ways of choosing two items from the four a,b,c,d are:

  ab, ac, ad, bc, bd, cd

These can be represented as four-bit binary numbers:

  1100, 1010, 1001, 0110, 0101, 0011

Of course, we can't afford to output the binary numbers themselves, as outputting one takes more than constant time. So the idea is to put them in a particular order, such that the changes from one to the next in that order are small, and to output the changes instead.

The order Jamie has chosen (I didn't note the explanation why) is as follows. Denote by W'(s,t) the language (that is, set of strings) with s zeroes and t ones in some order, and by W(s,t) the same set without 1^t 0^s. Then:

  W(0,t) = {}
  W(s,0) = {}
  W(s,t) = W(s-1,t) 0 | W(s,t-1) 1 | 1^{t-1} 0^s 1

  W'(s,t) = 1^t 0^s | W(s,t)

[I think I have that right, but I omitted to take any notes.]

Here, {} is the empty language (the empty set of strings), | is union of languages (union of sets of strings), and juxtaposition is composition of languages (cartesian product of concatenation of strings).

For example, the third line says that to make a string of s zeroes and t ones (without having all the ones followed by all the zeroes), either take a similar string of s-1 zeroes and t ones and follow it with a 0, or take a similar strong of s zeroes and t-1 ones and follow it with a 1, or take the single string of t-1 ones, s zeroes and a final 1.

That rather perverse characterization of the languages is chosen for a reason. Instead of thinking about sets of strings, now think about ordered-without-duplicates sequences of strings, and use the same recurrence. Now {} is the empty sequence, | is concatenation of sequences, and juxtaposition is cartesian product (note we always use it with one argument a singleton, so the ordering is obvious) of string concatenation. Thus:

  W(1,1)  = [ 01 ]
  W(2,1)  = [ 010, 001 ]
  W(1,2)  = [ 011, 101 ]
  W(2,2)  = [ 0110, 1010, 0101, 0011, 1001 ]
  W'(2,2) = [ 1100, 0110, 1010, 0101, 0011, 1001 ]

The order of elements in W'(2,2) is the one to be used. Why this? It turns out that each can be generated from the previous one by a cyclic rotation right one place of an initial segment. For example, rotate all of 1100 right one place to get 0110, and rotate the first three elements of 0110 right one place to get 1010.

Alternatively, a cyclic rotation of the initial segment of length k one place to the right involves taking the element at position k (counting from 1) and moving it to the start. The values of k involved in this particular sequence are:

  1100 -4-> 0110 -3-> 1010 -4-> 0101 -3-> 0011 -4-> 1001

Now the challenge is to compute the sequence of k values, given the pair (s,t), as a loopless algorithm. Jamie has a recursive function ct' such that

  ct'(2,2) = [4,3,4,3,4]

(whose definition I leave to the enthusiastic reader). He knows that there is an imperative loopless algorithm for ct', but not yet how to make a purely functional one.

Friday, 3rd June 2005

(Minutes by JG)

Richard came with more on pattern-matching. He knows what the failure table for Boyer-Moore matching should be, and has a simple specification for it that amounts to a quadratic-time program. He has a nearly-always-linear program, but it's a long argument as to why the program is correct.

Friday, 6th May 2005

Present: RSB, SC (minuter), BS

RSB has been thinking about pattern matching again, revisiting a long-ago pattern-matching paper [1]. He wants to get from

     matches ws

to the Boyer-Moore algorithm [2].

Side thoughts: how do you test a pattern-matching algorithm? What kind of text do you use? How long a pattern? What about the efficiency of pattern-matching algorithms?

     matches :: Eq a => [a] -> [a]

     matches ws
   =
     map length . filter (ws _ends_) . inits

   = { Using
       map f . filter p = map fst . filter snd . map (fork (f,p))
       where fork (f,p) x = (f x,p x) }

     map fst . filter snd . map (fork (length, ws _ends_) . inits

   = { Diverging a bit from the Knuth Morris Pratt [3] algorithm,
      as map (fork (f,g)) xs = zip (map f xs) (map g xs)  }

     map fst . filter snd . zip [0..] . map (ws _ends_) . inits

   = { ws _ends_ xs = (reverse ws) _begins_ (reverse xs)
       and so, writing sw for reverse ws  }

     map fst . filter snd . zip [0..] . map (sw _begins_) . map reverse 
. inits

   = { now introducing a scan, with flip (:)  }

     map fst . filter snd . zip [0..] . map (sw _begins_) . scanl (flip 
(:)) []

   = { Apparently KMP [3] generalises the "ws _ends_", but that's not
       where we were going. RSB didn't like the boolean, and wanted to 
generalise
       it a little, to an integer...

          xs _begins_ ys =   (llcp xs y = length xs)

       where llcp = lengthof longest common prefix
       and we will write m for length ws  }

     map fst . filter ((=m). snd) . zip [0..] . map (llcp sw) . scanl 
(flip (:)) []

This is not a bad matcher, according to RSB, who tried it a few times on random examples. It's the naive matcher.

But how to get to Boyer-Moore?

When trying to match a pattern in a text


          |----pattern-----|
   |--------------text---------------|

KMP looks at possible beginnings of the pattern in the text; Boyer-Moore looks at the end.

RSB then proceeded to explain some pre-processing you can do on the pattern to be matched, to yield a table of values. He asked for BS and SC to supply him with a pattern, any pattern, which we duly did, and then he completely rewrote it into the pattern:

bbababbabb

Suppose you're looking for bbababbabb in a text, you check whether it matches from the right, and the last character doesn't match, like this:


             bbababbabb
     |----------------*---------------------------|

(* designates a non-match)

Then you'd have to next try matching the pattern two spaces further to the right (no use trying one space along because that's got another b there). This explains the first line of the table

   # chars    right
   matching   shift
       0        2
       1        1
       2        8
       3        8
       4        3
       5        8
       (etc.)

e.g. for two characters matching, the right shift would have to be 8

             bbababbabb
     |--------------*bb---------------------------|

Another example:

     abcacbaba

   # chars    right
   matching   shift
       0        1
       1        5
       2        2
       3        8
       (etc.)

Then RSB stated the Boyer-Moore algorithm

     boyer ws  = process m 0 . scanl (flip (:)) []
          where m      = length ws
                sw     = reverse ws
                shifts = table sw
                k      = last shifts


                process t n [] = []
                process t n (ns:nss)
                  | t==s      = n : process k (n+k) (drop (k-1) nss)
                  | s+p<m     = process m (n+p) (drop (p-1) nss)
                  | otherwise = process p (n+p) (drop (p-1) nss)
                    where p = shifts!!s
                          s = llcp sw (take t ns)

table constrution not given, but table :: [a] -> [Int]

As far as costs go, RSB wants it to be linear in the length of the text, and wants the table construction to be linear in the length of the pattern.

The cost of computing s is s steps. The cost of computing p is s steps.

RSB wants a proof that it's linear. He also wants to compute the table in linear time.

RSB also mentioned books by Crochemore Jewels of Stringology [4] and Gusfield (DNA matching etc.) [5], which describe these sorts of algorithms.

Another point about complexity: one might argue that the Boyer-Moore couldn't be described as sub-linear, even though you might not have to inspect all the characters, as you are still going to have to read all the characters in, anyway. RSB wondered if there was a way of making each '=' comparison expensive?

SC suggested an analysis of web logs. You might be looking at lists of URLs, trying to find certin patterns of behaviour (for example, to find out whether bots were accessing your web server in consistent patterns of URL accesses), and so each comparison would be inspecting URLs, which could have expensive comparisons as there are often many characters in a URL.

BS suggested matching words rather than characters, with shorter words more likely to occur than longer ones.

RSB then started muttering strange utterings: "I am thinking, thought thinking am I" "I am I am am I?" "Yes Yes Yes" Eventually SC and BS realised that RSB was trying to think of whether repetitive sequences of words could occur in written text. RSB was wondering what pattern matching algorithms text editors actually use, and whether you could deliberately slow down the find function of a text editor by deliberately giving it an awkward string in a long awkward text, say a string of DNA.

RSB suggested that we all troop off to lunch. So we did.

References:

[1] R. S. Bird and J. Gibbons and G. Jones Formal Derivation of a Pattern Matching Algorithm Science of Computer Programming, July 1989

[2] R.S.Boyer, J.S. Moore A fast string searching algorithm Communications of the ACM. 20:762-772. Additional explanation

[3] D.E. Knuth, J.H.Morris (Jr), V.R.Pratt (1977) Fast pattern matching in strings SIAM Journal on Computing 6(1):323-350. Additional explanation

[4] Maxime Crochemore, Wojciech Rytter Jewels of Stringology: Text Algorithms ISBN: 9810248970

[5] Dan Gusfield Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology ISBN: 0521585198

Sharon Curtis (sharoncurtis@brookes.ac.uk)

oucl

research

pdt

minutes

Updated December 2005	Home \| Search \| SiteMap \| Feedback	© OUCL, 1994-2005
Courses \| Research \| People \| About us \| News