Coding For Food: September 2010

Sunday, 26 September 2010

Stepping Stone Algorithm in F#, Part 2

These are only a few points of information regarding the previous post.

Data Structures

The makeDict function creates a lookup table where the values are sequences. In a real-world setting it may be worth checking if it isn’t more efficient to make them lists. The modification of the function itself is easily made.

let makeDict keySelector valSelector lst =
    lst
    |> Seq.groupBy keySelector
    |> Seq.map (fun (key, valSeq) -> (key, valSeq |> Seq.map valSelector |> Seq.toList))
    |> dict

This requires all the subsequent calls to Seq.tryPick to be replaced by List.tryPick. Seq.tryPick will still work, but if there should be any efficiency gain in switching to lists, then there’s no point in still treating them as generic sequences.

Simplifying the Recursive Search Functions

There is only one version of makeDict, even if the keys and values to be extracted out of the initial solution list are quite different for rows and columns. This was achieved by passing some functionality into makeDict, and supplying different functions (or rather, a different order of the same functions) when creating each lookup table.

The two nested functions findInRow and findInCol within findPath also look quite similar, and they are really symmetric. Is there a way to simplify them in the same way and only have one single function? I have been trying around with this, and the best (only) solution I could come up with is the following.

let findPath ((targetRow, targetCol) as targetRowCol) lstStones =

    let rowColDict = [|lstStones |> makeDict fst snd; lstStones |> makeDict snd fst|]

    let rec findInRowCol selFix selMov ((currentRow, currentCol) as currentRowCol) =
        let dict = rowColDict.[selFix (0,1)]
        dict.[selFix currentRowCol] 
        |> Seq.tryPick (fun idx ->
            if idx=selMov currentRowCol then None
            else
                let cellRowCol = (selFix (currentRow, idx), selFix (idx, currentCol))
                if idx=selMov targetRowCol then Some([cellRowCol])
                else
                    match findInRowCol selMov selFix cellRowCol with
                    | Some(l) -> Some(cellRowCol::l)
                    | _ -> None)

    findInRowCol snd fst (targetRow, targetCol)

So yes, it is certainly possible, but I think this needs some more research to see if there is something that can be done to also make it aesthetically pleasing. I am not explaining the code here, be it a riddle.

Stepping Stone Algorithm in F#, Part 1

After inventing a new shinier sort of wheel and writing my own string class in C++, it was time to re-implement some OR algorithms, and I spent some time creating F# code for the Transportation Problem. The calculations are mostly routine and not very interesting, but at the heart of it all is a little mechanism that I found lends itself to be coded in a very succinct and elegant way, with a few standard library calls and little more; I owe to Microsoft and the F# team my very knowledge of the word succinct, and it is one of the three main characteristics of the language (the others being expressive and efficient, but I was familiar with those).

I may or may not show the complete code for the transportation algorithm on a later occasion, but for now I am just pulling this little mechanism out of its context and abstract it for isolated treatment: the Stepping Stone algorithm.

Recap

Very quickly, because it is not necessary in order to understand the code: In the transportation problem we have a number of suppliers S₁,..,S_n that each supply a certain quantity of something. And we have a number of destinations or sinks D₁,…,D_m that consume certain amounts of said something. m and n can and will be different. The something needs to be transported from the S_i to the D_j (i in 1…m, j in 1…n) and each S_i can deliver to more than one D_j. Equally, each D_j can receive from more than one S_i. The total amount of something delivered is the same as the total amount received. Delivering from a certain S_i to a D_j costs C_ij per unit of something.

The problem: Find out how much each S_i has to deliver to each D_j so that everyone is happy and the total cost is minimal.

The way this is done:

Find some solution that satisfies everyone, without necessarily having minimal cost. This is easy.
Check if this is optimal. This needs a bit of calculation.
If not optimal, find a new ‘satisfactory’ solution which is a bit better, by switching one supplier/destination S_i-D_j against another S_k-D_l. Then rinse and repeat.

Step 3 is the one that is of interest here.

Representation

The problem can be solved as a normal linear program using the simplex algorithm. However, the coefficient matrix for this has a very special form that allows for a more compact (or even succinct) representation, as a transportation tableau.

Fig. 1 Fig. 1

Each row in Fig. 1 represents all the deliveries of one supplier, and each column is what one particular sink consumes. Normally, this would be where we see the cost for each combination of the two, but this is not really needed at this point.

The orange cells represent a first solution to the problem, i.e. combinations of suppliers and destinations where everyone is satisfied. However, at this point the transportation algorithm may want to switch to a different solution by, say, having supplier S₃ deliver to destination D₁ (and you will have noticed how here in programming world we start counting from 0 and not from 1). This basis change happens by finding a path from the new (entering) cell through all the cells in the current solution and back to the new cell, with a few conditions. Fig. 2 shows the tableau with the new cell in blue.

Fig. 2 Fig. 2

Here is how that path needs to be constructed:

1. Start from one entering cell and find another cell in the same row or column that is in the solution (orange).

2. Continuing from the selected cell, find the next one within the solution (orange). If you moved within a row to find the current cell, now you need to check within that cell’s column; if you found it in the same column as the one before, now the next one must be found in the same row as the current one.

If in the process you reach the entering cell again, then you’re done. Otherwise keep repeating step 2. The path we needed to find for entering cell (3,1) is shown in Fig. 3.

Fig. 3 Fig. 3

The Mathematics here is not quite complete – we are not (yet) solving the transportation problem as such, don’t make use of the cost matrix, or of the supplier and sink capacities, and are only concentrating on a particular step of the algorithm. But this can be nicely isolated and formulated as a coding problem without all the additional information. Once this path finding operation is out of the way, the remaining calculations are rather trivial, and we might fill them in later.

The Coding Task

Given a list of solution cell co-ordinates of the form (row, column), and a single co-ordinate for the entering cell that can not be part of the solution, find a valid path from the entering cell back to itself through any number of solution cells, with the following restrictions:

Only move horizontally or vertically (no diagonal movements), and
After moving along a row, you have to move along a column, and vice versa.

Our input data looks as follows:

let solution = [(0,0);(0,1);(2,0);(2,2);(2,3);(1,2);(3,3)]
let entering = (3, 1)

Data Structures

While performing the algorithm, the code needs to search the tableau by row and by column, potentially many times, so the representation as a list is not ideal. It would require a linear search each time we need to pull out a row or a column. This could be made a bit better by sorting it first, but that could only work for either rows or columns. If, on the other hand, we create one big matrix or 2-dimensional array, the row and column splicing would be pretty easy and efficient – however, this would waste lots of space. In a real transportation problem, we have

n = number of suppliers = number of rows

m = number of destinations = number of columns

m+n-1 = number of cells in any solution

m*n = number of cells in the matrix

(m+n-1)/(m*n) = occupation of the matrix

The bigger the problem gets, the less space of the total matrix is actually needed.

4 suppliers, 4 destinations –> 43.75%

10 suppliers, 20 destinations –> 14.5%

100 suppliers, 200 destinations –> 1.495%

This is a case where at first glance a sparse matrix implementation would come in handy. However, they have their own issues, most importantly in this case the lack of efficient row and column splicing (they tend to have one of the two at most). We create our own problem-specific implementation:

One lookup table (dictionary) with a list of column indices for each row index
Another lookup table (dictionary) with a list of row indices for each column index

Creating the Dictionaries

What we have is

let solution = [(0,0);(0,1);(2,0);(2,2);(2,3);(1,2);(3,3)]

and, just considering one of the two cases, what we want is something that looks similar to

(* pseudo code *)
let (lookuptable: keyvaluetype) = [(0,[0;1]);(1,[2]);(2,[0;2;3]);(3,[3])]

Luckily, this can be done by piping the input list through a few standard library functions. First, let F# group the tuples of the input list by some key. The Seq.groupBy functions wants a function argument that calculates for each element what its key value is, and then does the grouping. Just looking at our first case (“One lookup table (dictionary) with a list of column indices for each row index”) the key to group by has to be the row index, so the function we provide to Seq.groupBy pulls the row index out of each tuple.

solution |> Seq.groupBy (fun (r,_) -> r)

And, remembering that such a function already exists in the library, this can be abbreviated to

solution |> Seq.groupBy fst

This creates a collection of all elements that correspond to each row index. Since the elements are the (row, column) tuples, what we have now is

(* pseudo code *)
[(0,[(0,0);(0,1)]);(1,[(1,2)]);(2,[(2,0);(2,2);(2,3)]);(3,[(3,3)])]

It is close to, but not quite, what we want. Seq.map is used to convert a sequence of some sort of elements into a sequence of some other sort, so this is what we need. Make the sequence of tuples where the second element is a list of index tuples into one where the second element is a list of indices.

(* ... *)
   |> Seq.map (fun (key, valSeq) ->
       (key, valSeq |> Seq.map (fun(_,c) -> c)))

We are using Seq.map twice, once to transform the ‘outer’ sequence, and then for the sequence that is contained in each of the tuples that are the elements. Again, there’s a library replacement for the function that pulls out the column indices and the code can be shortened to:

(* ... *)
   |> Seq.map (fun (key, valSeq) ->
       (key, valSeq |> Seq.map snd))

The new collection is

(* pseudo code *)
[(0,[0;1]);(1,[2]);(2,[0;2;3]);(3,[3])]

In the end we can use the dict function to create a lookup table out of this. A complete function that takes a solution list as in our input data and transforms it into a lookup table like those that we decided we need for the algorithm, could be implemented like this:

let makeDict lst =
   lst
   |> Seq.groupBy fst
   |> Seq.map (fun (key, valSeq) -> (key, valSeq |> Seq.map snd))
   |> dict

Now we need to do the same, but for the indices being on the columns. The function would look exactly the same, only for the roles of the calls to fst and snd swapped. A more generic version of the function is in order, where this information can be passed into instead of being hard coded:

(* The more general function *)
let makeDict keySelector valSelector lst =
   lst
   |> Seq.groupBy keySelector
   |> Seq.map (fun (key, valSeq) -> (key, valSeq |> Seq.map valSelector))
   |> dict

Path Finding

Now that we can look up any column for a row index and any row for a column index, it is pretty straightforward to take our entering cell and search for a path back.

let findPath (targetRow, targetCol) lstSolution =

   (*  Create the lookup tables for rows and columns.
       Both use makeDict but pass in the 'selector'
       functions in different order    *)
   let rowDict = lstSolution |> makeDict fst snd
   let colDict = lstSolution |> makeDict snd fst

   (*  From current position, find the row we're in,
       then go through all cells (except our current one).
       If we end up on the same column as the entering
       cell, aka 'target', then this is the last element
       of the list we are creating. Otherwise, check to
       see if we can move on in this column. If both fails,
       there's nothing there there.    *)
   let rec findInRow currentRow currentCol =
       rowDict.[currentRow]
       |> Seq.tryPick (fun col ->
           if col=currentCol then None
           else if col=targetCol then Some([(currentRow, col)])
           else
               match findInCol currentRow col with
               | Some(l) -> Some((currentRow, col)::l)
               | _ -> None)

   (*  Like findInRow, but looking through current column *)
   and findInCol currentRow currentCol =
       colDict.[currentCol]
       |> Seq.tryPick (fun row ->
           if row=currentRow then None
           else if row=targetRow then Some([(row, currentCol)])
           else
               match findInRow row currentCol with
               | Some(l) -> Some((row, currentCol)::l)
               | _ -> None)

   (*  Return value is created by calling one of the
       nested functions.   *)
   findInCol targetRow targetCol

This uses the Seq.tryPick function. Seq.tryPick iterates through a sequence and calls a specified function for each element. That function calculates a value from its argument and wraps it up in an option type value, or returns None. For None, Seq.tryPick then goes to the next element of the sequence, otherwise the returned option type value becomes the return value of Seq.tryPick. This is very similar to Seq.tryFind, except that Seq.tryFind uses a predicate function just to check if a sequence element is the one to be picked, and then wraps this element itself into an option type value; Seq.tryPick is a bit more flexible in that it allows to return something that is not the element value itself.

The thing we want to return here is a list with the elements of the found path. When we successfully arrive at the end of a valid path, the current cell becomes the end of that list, in

            else if col=targetCol then Some([(currentRow, col)])

            else if row=targetRow then Some([(row, currentCol)])

If the current cell is not the end of the path, but if a valid path is found ‘from here’, then it is prepended to the list that has been returned in the recursive foundInRow or foundInCol call.

                match findInRow row currentCol with
               | Some(l) -> Some((row, currentCol)::l)
               | _ -> None)

If this recursive call has not found anything, then the current cell is not part of the path, and returns None to inform its own caller accordingly. The top-level caller of this will either return an option type value with a list of cells that show the valid path, or None. It may be instructive to see how the algorithm finds its path by using backtracking: the way the recursive calls are made results in a trial-and-error movement along the cells in the current solution.

Backtracking

Fig. 4 shows how, starting from entering cell (3,1) in step [1], the algorithm looks through the current column and finds a cell in step [2], then looks in that row [3], then in the new column [4], then in the row and tries the first cell it finds there [5] and looks through that cell’s column [6]. The path traced so far is represented in the state of the code through the current call stack of the recursive calls to findInRow and findInCol. However, from the last cell that was found (1,2), there is no way forward.

Fig. 4

Fig. 5

Fig. 6

Step [6] did not lead to a result (see Fig. 5) and another try has to be made starting from [5]. However, there’s nothing else to try in that column, so step [5] is itself invalidated. The process has to re-start from [4], and as Fig 6 shows, new steps [5] and [6] are found that result in the required path back to the entering cell.

It can be instructive to see the backtracking process unfolding during program execution. The following is a modified version of the findPath function, that is functionally identical, but prints out its current activity.

let findPath (targetRow, targetCol) lstStones =

   let rowDict = lstStones |> makeDict fst snd
   let colDict = lstStones |> makeDict snd fst

   let rec findInRow tab currentRow currentCol =
       rowDict.[currentRow]
       |> Seq.tryPick (fun col ->
           if col=currentCol then None
           else
               printfn "%s%A" tab (currentRow, col)
               if col=targetCol then
                   printfn "%sFound." tab
                   Some([(currentRow, col)])
               else
                   match findInCol (tab + "..") currentRow col with
                   | Some(l) -> Some((currentRow, col)::l)
                   | _ -> None)

   and findInCol tab currentRow currentCol =
       colDict.[currentCol]
       |> Seq.tryPick (fun row ->
           if row=currentRow then None
           else
               printfn "%s%A" tab (row, currentCol)
               if row=targetRow then
                   printfn "%sFound." tab
                   Some([(row, currentCol)])
               else
                   match findInRow (tab + "..") row currentCol with
                   | Some(l) -> Some((row, currentCol)::l)
                   | _ -> None)

   findInCol "" targetRow targetCol

Example

Fig. 7 Fig. 7 Fig. 8 Fig. 8

The current solution as shown in Fig. 7 is quite realistic, for a small problem size, and it would look like this after 2 or 3 steps into the transportation algorithm. While it seems quite benign, we can see from Fig. 8 that the path belonging to an entering cell can become a bit complex. Here is the data and code to try out this example (assuming that the earlier mentioned functions are implemented somewhere.

let current = [(1, 3); (1, 0); (2, 0); (2, 1); (10, 1); (10, 5); (9, 5);
               (9, 4); (7, 4); (5, 3); (5, 2); (3, 2); (0, 0); (4, 2);
               (6, 3); (8, 4); (11, 5); (11, 6); (12, 6); (13, 6);
               (13, 7); (14, 7); (14, 8); (15, 8); (16, 8); (16, 9);
               (17, 9); (18, 9); (19, 9)]

let entering = (15, 3)

current |> findPath entering |> printfn "%A"

Remarks

This is only a part of the overall transportation algorithm, which I think merits some exploration on its own. I will post the whole thing when I have some more time. (Update 18-Jan-2011: This has now been done, see Transportation Algorithm in F#)
Instead of using the self-made lookup tables, this really shouts for the use of some standard sparse matrix implementation. This will be topic of a future post.
In the real transportation problem, if it is balanced and solutions are not degenerate, there exists one unique path for the entering cell, meaning that a path can always be found and that it doesn’t matter if you start looking in the row or column. In arbitrary examples this may be different, so you can have situations where there is no path, or where the one you find is only one possibility, and where there may be a difference between kicking this off with a call to findInRow or to findInCol.
There are a couple of alternative implementations for the findPath function, which I talk about here.

Monday, 20 September 2010

FsYacc: Parsing Variable-Length Lists, Part 2

Without writing any special F# code, and only relying on the parser, the results of the previous post can be achieved by making this change.

/*
PFunctionArgs:
    | PExpression                       { $1 }
    | PExpression COMMA PFunctionArgs   { ListExpression($1::[$3])  }
*/

PFunctionArgs:
    | PExpression                       { $1 }
    | PExpression COMMA PFunctionArgList   { ListExpression($1::$3)  }

PFunctionArgList:
    | PExpression                           { [$1] }
    | PExpression COMMA PFunctionArgList    { $1::$3  }

Of course, there are no interesting F# code samples to be had.

Sunday, 19 September 2010

FsYacc: Parsing Variable-Length Lists (and here’s a Recursive Active Pattern)

The previous post on parsing spreadsheet cells showed how to transform a text string with what is a typical spreadsheet cell into an internal representation in F# discriminated unions, so that it can be further processed by the code. There were quite a few little gaps left that are trivial to fill, like accommodating additional arithmetic operators like ^, or allowing for negation like –A2.

But there is also one huge gaping hole, which should become obvious when looking at a formula like

=SUM(A1,A2,A3,A4)

and which merits to be explored separately: argument lists. To be sure, the separator character can be either a comma or a semicolon, depending on the actual spreadsheet program and on language settings. Without loss of generality, we will be using commas. We also assume that the length of the list is unknown in advance, otherwise this would be a simple extension of our existing grammar rules. Also, lists of expressions6 is something that only ever happens in function calls (within our spreadsheet sample, that is). F# programs deal with data structures of variable length by using some sort of recursion, and this is where we start our first approach.

Starting the first try

The code from the previous post needs a few minor changes. First of all, the lexer needs to know about the existence of commas, and that they might be important to us; we want to see them as tokens. In the lexer.fsl file, that means adding a line to the rule we already have there:

| ","           { COMMA }

The identifier for the new token has to be defined, so the list of tokens in parser.fsy needs an additional line:

%token COMMA

The AST has to be able to express the concept of an argument list, or rather a list of expressions. We add this to the existing Expression DU in ast.fs:

(* ast.fs *)

(* ... *)
and Expression =
    (* ... *)
    | ListExpression of Expression list

(* ... *)

Now the only thing left to do is tell the parser how to recognise a list of expressions in the input token stream and make a ListExpression out of them. The current grammar composes a function out of two elements, a name and an expression.

    | FUNCTION LPAREN PExpression RPAREN    { Function($1,$3) }

After the introduction of argument lists, we have a special case for functions arguments, which is not covered by the grammar for generic expressions, as previously. We invent a new construct for whatever we put into a function argument. Generic expressions need to be one part of it (they haven’t gone away), and the other option are our new variable-length lists of expressions.

// parser.fsy

// ...
PExpression:
    // ...
    | FUNCTION LPAREN PFunctionArgs RPAREN    { Function($1,$3) }
// ...
PFunctionArgs:
    | PExpression                       { $1 }
    | PExpression COMMA PFunctionArgs   { ListExpression($1::[$3])  }

The new PFunctionArgs takes either any combination of tokens that can be interpreted as a PExpression and returns it unchanged; since this is used from within the PExpression case for function calls, it means that the token(s) just end up in the second argument of the Function DU constructor, i.e. it behaves exactly like before. The difference comes in the second case of PFunctionArgs. This finds any PExpression that is followed by a comma, and makes it the first element of a new list; after that, it takes everything after the comma, as long as it can also be interpreted as PFunctionArgs. If it can – and if it matches the first case of PFunctionArgs – we end up with a two-element list object, otherwise, the recursion can go on until some component of the token string is found that matches the first case.

This recursion happens in the parser code, not in the AST.

There is no way that the ListExpression constructor can enter into any sort of recursion. But the recursion within the parser results in a variable number of nested constructor calls.

The previous post had some test code to visualise the contents of the AST. The astwalker.fs file needs some addition in order to recognise the new ListExpression, but since this contains just a standard list type whose element types are already well known and accounted for, it is easy to add this case using standard List library functions.

(* astwalker.fs *)

(* ...*)
    and walkExpression x =
        match x with
        (* ...*)
        | ListExpression(l) -> 
            "[" + 
            List.fold 
                (fun acc e -> acc + "," + walkExpression e)
                (List.head l |> walkExpression) 
                (List.tail l)
            + "]"

(* There's a bit of ugliness going on here in order to
   get a correct comma count in the output string. *)

After making all the changes the whole project should still/again build successfully and we can try the test function with a simple formula.

"=SUM(A1,A2,A3,A4)" |> convert |> printfn "%A"

Surprises

The first, and maybe biggest, surprise is that this is all still works in a way and really can take apart a list of expressions and create some output for it. What we could have seen coming is that the output wouldn’t be exactly the simple list of expressions we were after. Rather we have

"(SUM,[A1,[A2,[A3,A4]]])"

Going back to the new bits of grammar in the parser file, we can try to understand events step by step. Let’s say we hit the grammar rule for PFunctionArgs with the following string.

A1,A2,A3,A4

The second case of PFunctionArgs knows that A2 has to be treated as a PExpression, and finds after some looking through the grammar tree that this is a CELLREF, hence wraps it up in a SingleCellRef.

ListExpression(SingleCellRef(A1)::[$3])

Now it needs something to do with the second value ($3).

A2,A3,A4

This turns out to match nicely with PFunctionArgs again, so it gets the same treatment.

ListExpression(SingleCellRef(A2)::[$3])

Then this is put back into the gap left in the first iteration, so by now the total expression looks like:

ListExpression(SingleCellRef(A1)::[ListExpression(SingleCellRef(A2)::[$3])])

And so on, until in the end the second argument ($3) is reduced to just the last element of the list, for which an appropriate PExpression is found, and then recursion ends. Writing out the whole result and applying the cons operators gives the result.

ListExpression([SingleCellRef(A1);ListExpression([SingleCellRef(A2);ListExpression([SingleCellRef(A3);SingleCellRef(A4)])])])

This is, in fact, not a data structure that in itself poses too many problems. But in the context of a larger AST, and when we imagine that each element of the list can itself contain further lists and complex expressions, and mainly because what we wanted initially was an argument list (and not an argument-linked-list-of-lists-sort-of-thing), we definitely want to flatten this.

Flattening while traversing (recursive Active Pattern here)

The flattening logic can be built into the code that traverses the list; in our sample that would be the functions in astwalker.fs. This is where Active Patterns come in. Active Patterns allow to match against arbitrarily complex data structures, because they allow the normal pattern matching mechanism to be replaced by custom code. Active Pattern definitions are, in fact, functions, so inside them we can do more or less everything we can do in any other function, including recursion.

This Active Pattern contains the code to match the data structure we have built up so far and ‘match’ it against a flat list of Expression DUs.

let rec (|FlatListExpression|) e =
    match e with
    | ListExpression([eh; FlatListExpression(et)]) -> eh::et
    | _ -> [e]

(*
    The value e is either a ListExpression that contains
    a list whose first element is an Expression and second
    is another ListExpression that also needs flattening;
    or it is a plain Expression element.
    We know there are no other cases, so we can
    use a catch-all pattern for the second case.
*)

In the first match case within the Active Pattern definition, we pull the first list element out of the ListExpression and cons it with the list that is in the second list element, which in turn is another ListExpression, and hence may need flattening. For this, we use the Active Pattern we are just defining. This is truly awesome.

The only little glitch is that in order to use it in astwalker.fs we need to add it to the existing list of mutually recursive functions by defining it as

    and (|FlatListExpression|) e =
        match e with
        | ListExpression([eh; FlatListExpression(et)]) -> eh::et
        | _ -> [e]

This definition takes account of mutuality (the Active Pattern needs ListExpression and the walk functions will need the Active Pattern) and of recursion. The new pattern can now be used at the same location where we previously matched against ListExpression, and the result is that the ‘match’ found will be with the flattened list.

    and walkExpression x =
        match x with
        (* ... *)
        //| ListExpression(l) -> 
        | FlatListExpression(l) -> 
            "[" + 
            List.fold 
                (fun acc e -> acc + "," + walkExpression e)
                (List.head l |> walkExpression) 
                (List.tail l)
            + "]"

Going back to the main program module and running the test function again

"=SUM(A1,A2,A3,A4)" |> convert |> printfn "%A"

we get

"(SUM,[A1,A2,A3,A4])"

Flattening while parsing

Despite of the blinding awesomeness of the whole Active Pattern mechanism, it may not be the preferred solution. The main concern is here that the flattening has to be performed every single time the AST is traversed. It is, of course, thinkable to only do this once and write back the result – or rather, to create a new data structure with the unflattened lists being replaced. However, it may be easier to do the whole process while parsing.

Now, Active Patterns work very well when reading existing data structures, but they don’t seem to help all that much when building them up from scratch. What we are looking for is something that can replace the ListExpression constructor, take the same input (this comes from the parser code), and ends up with a flat ListExpression that can be fed into the Function constructor (as per our grammar).

let MakeListExpression l =
    match l with
    | h::[ListExpression(t)] -> ListExpression(h::t)
    | _ -> ListExpression(l)

The MakeListExpression function is expected to be called for the first time (in the innermost nested call, see the nesting for ListExpression above) with a list of Expression expressions. This is the second match case, and we just wrap the list in a ListExpression object.

In subsequent calls we get a list with the first element being an Expression, and the second one of the ListExpression expressions that were created in previous calls to the function. We unwrap the list that we previously had wrapped into a ListExpression, cons the first element, and wrap it all up again for the next step. When the process stops at any given point, we always end up with a ListExpression that contains a flat list with all Expression expressions we have processed.

The parser file so far only uses DU constructors, but there is nothing to keep it from calling just any F# function from our code, as long as all the F# expressions we ask it to produce fit together according to the grammar rules. In this case we can just replace the ListExpression constructor with a call to MakeListExpressions.

PFunctionArgs:
    | PExpression                       { $1 }
    | PExpression COMMA PFunctionArgs   { MakeListExpression($1::[$3])  }

For completeness, we also need to put the function somewhere the parser can use it. The ast.fs file comes to mind, and is in fact a good place. Out of the box, this is started with a namespace declaration, though, and cannot contain any values/functions. So other than adding the function to the file, we also need to change

namespace Ast

module Ast

Remarks

This also with nested calls, try parsing "=SUM(A1,A2,AVERAGE(B1,B2,B3,B4),A4)" or similar.

Showing the first solution and then discarding it seems like a waste of time. However, there may be reasons in a particular project to stick with it. And it shows a nice example of an Active Pattern making sense of a complex data structure.

There is at least one other solution. I preferred something that could be done in F# code, because it is easier to understand and debug, and leaves the lexer and parser files short and clean.

In the code that produces the string output, I decided to write out Expression lists within square brackets “[exp1, exp2, exp3]”. It would probably be more in line with the rest of the code to use a more Lisp like style, e.g. “(ARGLIST, exp1,exp2,exp3)”. However, the text output is only for visualisation and doesn’t affect the internal data structures anyway, and I like it more this way.

Friday, 17 September 2010

Lispifying Excel Cells with FsLex/FsYacc

For a little while on the train this afternoon I was convinced I had coined a new expression. Back at my computer, though, the Internet informed me that people have been “lispifying” things since before I was born, and I just hadn’t been in on the joke. Nevertheless, it’s a good word, and I will proceed to do just that, lispify something.

There was a question on Stackoverflow a couple of days ago as to how it would be possible in F# to take the content of an Excel spreadsheet cell and interpret it to find out if it was a simple value or a formula, and in the latter case, to return the operation and arguments. The poster of the question was after something that would return either the value or a tuple with the details of the formula, for the exact request see here.

This struck me as a good opportunity to get my hands on the F# lexer and parser. As an example it seems just simple enough not to get completely lost in all sort of arcane specialist tweaks that only hardcore compiler builders know about, and yet it offers some real insight into how the whole process works.

First things first

We need some tools. As per the current version of F#, the parser/lexer binaries are not in the main distribution, but in the F# PowerPack, so this needs to be downloaded and installed. I have used Visual Studio 2010 for this, but 2008 might just work. And in order to get all the bits and pieces together quickly, we are using the F# Parsed Language Starter template for Visual Studio.

There is some documentation out there on how to use FsLex and FsYacc. Robert Pickering gives a good introduction in Foundations of F#; A bit more technical is the chapter in Expert F# (Don Syme, Adam Granicz, Antonio Cisternino), which also touches on potential issues with parsing, conflicts, operator precedence, and some more. I am guessing that if you read this then you know both books anyway. There is also the Wikibook, which I like very much. Read through all this before continuing.

Getting started

Open a new project in Visual Studio. Find the F# project templates in the list and select the Parsed Language Starter.

This creates a project to parse instructions for a little calculator application. We won’t be needing this, but it is a rather nice way to get all the necessary files created and make sure that the lexer and parser are called during the build. Currently neither is officially integrated into Visual Studio, which means that the only way to use both programs is either a separate command line window or a pre-build step in the project settings; both is a bit clumsy and doesn’t do much for keeping track of build dependencies either.

The files in the new project are

lexer.fsl, which has the syntax description that is needed to correctly tokenise the input strings.
parser.fsy, which describes our grammar and is used to make sense of the tokens that the lexer creates.
ast.fs, with definitions of a number of discriminated unions that make up the abstract syntax tree, the internal representation of the lexed and parsed input, i.e. the data structures that our program works with.
program.fs, from where the lexer and parser are called, fed with input, and where the rest of our program logic can go in.

Lispification

In terms of a brief specification, here is what the program is supposed to do:

1. Take a string in the form of a spreadsheet cell. I am using OpenOffice Calc where I am sitting now, and Excel on other occasions, and the basic format is pretty much the same. You have either

a number, or
a formula, starting with ‘=’

In the case of a formula, you can have numbers again, or references to other cells, in the form “A1”, “B2”, etc. and arithmetic combinations thereof, using the normal arithmetic operations. There can also be calls to built-in functions; the syntax for these is the function name, followed by a pair of parentheses with argument values inside. Valid cell content looks like

1002

=SUM(B2:B12)

=A1+AVERAGE(B2:B4)

2. Convert this into an internal representation that our program can work with. I should mention at this point that we can’t just create tuples; depending on the formula that is parsed, the result can have arbitrary length and depth, and tuples are not good for this. What we will get is a tree of discriminated unions, which are easy enough to traverse and use in F# anyway.

3. Once we have an internal representation, also create output strings, merely for visual pleasure. Output strings will look like comma-separated lists enclosed in parenthesis. The first element of such a list is the operation to perform (if any), and the rest are the arguments that it is performed on. Lists can be nested if the formulas are more complicated. The original cells above would be output as

(1002)

(SUM,(RANGE,B2,B12))

(+,A1,(AVERAGE,(RANGE,B2,B4)))

Trick question

Where does the term “lispify” come from?

The lexer file

The lexer.fsl file in the new F# project contains a header, enclosed in curly brackets {..}, a number of regular expression definitions and a single tokenisation rule. The header bit is used to pull in necessary definitions from other modules and we leave it intact. The regular expressions are used to match against bits of the input strings, and the rule instructs the lexer to emit tokens for every match that is found.

The token themselves are defined in the parser file, so both really mutually depend on each other. Some of the tokens have types associated with them, and are assigned a value whenever a corresponding match is found.

Where for instance the parser file contains a definition

%token <System.Double> FLOAT

any lexer rule that creates this token needs to assign it a value of type double (usually the bit of the input string that the rule has just been matched against):

| ['-']?digit+('.'digit+)?(['e''E']digit+)? { FLOAT (Double.Parse(lexeme lexbuf)) }

Many of the regular expressions and most of the rule are also used for our program, but just to avoid confusion, let’s remove all of them (leave intact the {...} header, though!), and replace them with the following bits:

(* Regular Expressions *)
let digit = ['0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
let cellref = ['A'-'Z']+digit+ // matches a cell reference, like "A2"
let function = ['A'-'Z']+   // matches a function name

and

(* Tokenisation Rule *)
rule tokenize = parse
| whitespace    { tokenize lexbuf }
| newline       { tokenize lexbuf }
| "+"            { PLUS }
| "-"            { MINUS }
| "*"            { ASTER }
| "/"            { SLASH }
| "("            { LPAREN }
| ")"            { RPAREN }
| ['-']?digit+                                    { INT32 (Int32.Parse(lexeme lexbuf)) }
| ['-']?digit+('.'digit+)?(['e''E']digit+)?        { FLOAT (Double.Parse(lexeme lexbuf)) }
| eof           { EOF }
| cellref       { CELLREF(lexeme lexbuf) }  // reference to another cell, like A3, G6, etc.
| "="           { FORMULA }
| function      { FUNCTION(lexeme lexbuf) } // a function in the spreadsheet cell
| ":"           { RANGESEP }                // separates two cell references to make a range

The parser file

The parser.fsy file is where everything comes together. The parser needs to know about the tokens that the lexer is to produce, and in fact, all those tokens are defined in this file. It also defines the grammar (== in which combination and order does our input make any sense, and what sense exactly?).

And depending on the tokens passed on from the lexer and how they are combined, the parser emits F# code into our program. The only code that we will produce here are discriminated union constructors, but in principle we could do other things. However, all we want to do now is set up a data structure, so that’s good enough.

The parser file can get a bit confusing, because it mixes all these different things, and you are working with a lot of identifiers that come out of three different name spaces:

token identifiers, that are defined here but used first by the lexer,
F# identifiers from the code templates, in our case the discriminated union constructor identifiers, and
the identifiers that make up the parser’s grammar.

It doesn’t really help that they normally represent related concepts and have very similar names. In the sample code, I have tried to make the distinction more obvious: all token identifiers are all uppercase (FORMULA), all names used in the grammar start with an uppercase ‘P’ (PExpression), and the rest are symbols from the F# side.

In the same way as with the lexer, we don’t take any prisoners here and get rid of all the token and grammar bits, and replace them with our own (leave the header %{..%}, though):

%start start

%token <System.Int32> INT32
%token <System.Double> FLOAT
%token PLUS MINUS ASTER    SLASH
%token LPAREN RPAREN
%token EOF
%token <System.String> CELLREF
%token FORMULA
%token <System.String> FUNCTION
%token RANGESEP

%left PLUS MINUS
%left ASTER SLASH

%type < Ast.Cell > start

%%

and

start: PCell { $1 }

PCell:
    | PConstant                         { ConstantCell $1       }
    | FORMULA PExpression               { FormulaCell $2        }

PExpression:
    | PConstant                         { Constant $1           }
    | PExpression ASTER PExpression     { InfixOperation(Times($1, $3))     }
    | PExpression SLASH PExpression     { InfixOperation(Divide($1, $3))    }
    | PExpression PLUS PExpression      { InfixOperation(Plus($1, $3))      }
    | PExpression MINUS PExpression     { InfixOperation(Minus($1, $3))     }
    | FUNCTION LPAREN PExpression RPAREN    { Function($1,$3) }
    | PCellref                          { CellRef($1)           }

PConstant:
    | INT32                             { Integer($1)           }
    | FLOAT                             { Float($1)             }

PCellref:
    | CELLREF                           { SingleCellRef $1      }
    | CELLREF RANGESEP CELLREF          { RangeCellRef($1, $3)  }

A few remarks: The %start token is where the root of our data structure will be, it is named ‘start’ (appropriately) and the associated type from our F# code is Ast.Cell

%type < Ast.Cell > start

This means that the parser will create F# code that creates an Ast.Cell from scratch, and elements of all the other data types depend on that. In a bit we will be able to see that this is indeed the case. From the parser’s point of view, the top level in the evaluation is ‘PCell’ (which is everything), which consists of elements ‘PConstant’ or ‘FORMULA PExpression’ etc. Where token names are used they match directly with the tokens from the lexer, for other identifiers there is another case further down the hierarchy (‘PConstant’ e.g. has either INT32 or FLOAT), but eventually every path from ‘PCell’ must always end in one or more lexer tokens.

For each element of the grammar that could be matched, the F# expression in curly brackets is emitted into the code. We can see that these all construct a hierarchy of nested discriminated unions.

The abstract syntax tree

The AST is in Ast.fs. This is a very theoretical term, what we have here in fact are type definitions for all the discriminated unions that eventually will make up our data structure. They need to be able to fit into each other in a way that doesn’t conflict with what we told the parser about our grammar, and one of them needs to be able to stand alone, but other than that, this bit is pretty straightforward.

Again, we remove all the type definitions from the existing file, and replace with our own:

type Cell =
    | FormulaCell of Expression
    | ConstantCell of Value

and Expression =
    | CellRef of Reference
    | Function of String * Expression
    | Constant of Value
    | InfixOperation of BinaryOperator
    
and BinaryOperator =
    | Plus of Expression * Expression
    | Minus of Expression * Expression
    | Times of Expression * Expression
    | Divide of Expression * Expression

and Value =
    | Float   of Double
    | Integer of Int32

and Reference =
    | SingleCellRef of String
    | RangeCellRef of String * String

Done (well, almost)

There is some code in the program.fs file, which may or may not work with our new lexer and parser, but I have just removed it and left the two most relevant lines

let cell = "=SUM(B2:B12)"
let lexbuff = LexBuffer<char>.FromString(cell)
let parsed = Parser.start Lexer.tokenize lexbuff

The project should now build ok (try!) and if you run it in the debugger until this point, you should be able to see the whole data structure in the variable ‘parsed’. For programmatic evaluation of the Excel formula, this should be sufficient. The type of the result is the type of the top most discriminated union that we have defined for this, and it is not very difficult to traverse the whole tree using pattern matches.

Done (really)

Here is the code that traverses the whole tree and creates the promised text output. I have added a file astwalker.fs to the project and put it in there. If you do the same, please remember that the two source files (program.fs and astwalker.fs) now need module names (because they are more than one), and you need to move files in the project so that program.fs remains the last file.

let rec walkCell x =
    match x with
    | ConstantCell(v) -> walkValue v
    | FormulaCell(f) -> walkExpression f

and walkExpression x =
    match x with
    | InfixOperation(b) -> walkBinaryOperator b
    | Constant(c) -> walkValue c
    | Function(f,arg) -> walkFunction f arg
    | CellRef(r) -> walkRef r

and walkBinaryOperator x =
    let extractOp = function
        | Plus(a, b) -> (a, b, "+")
        | Minus(a, b) -> (a, b, "-")
        | Times(a, b) -> (a, b, "*")
        | Divide(a, b) -> (a, b, "/")
    let (var1, var2, op) = extractOp x
    "(" + op + "," + walkExpression var1 
        + "," + walkExpression var2 + ")"

and walkValue x =
    match x with
    | Float(f) -> f.ToString()
    | Integer(i) -> i.ToString()

and walkFunction function_name arg =
    "(" + function_name + ","
    + walkExpression arg + ")"
    
and walkRef x =
    match x with
    | SingleCellRef(r) -> r
    | RangeCellRef(r1, r2) 
        -> "(RANGE," + r1 + "," + r2 + ")"

If we go back to program.fs, we can create a little function that calls all three components

let convert cell =
    let lexbuff = LexBuffer<char>.FromString(cell)
    let parsed = Parser.start Lexer.tokenize lexbuff
    AstWalker.walkCell parsed

and then call it for a bunch of test input strings

let teststrings = [ "1002";
                    "=SUM(B2:B12)";
                    "=A1+AVERAGE(B2:B4)";
                    "=B3+G7";
                    "=MIN(MEAN(A3:A20))";
                    "=A1+B4*C6"
                    ];

teststrings |> List.iter (fun s -> (convert s |> printfn "%A\t%A" s))

The output should look like

“1002” “1002”

"=SUM(B2:B12)" "(SUM,(RANGE,B2,B12))"

"=A1+AVERAGE(B2:B4)" "(+,A1,(AVERAGE,(RANGE,B2,B4)))"

"=B3+G7" "(+,B3,G7)"

"=MIN(MEAN(A3:A20))" "(MIN,(MEAN,(RANGE,A3,A20)))"

"=A1+B4*C6" "(+,A1,(*,B4,C6))"

Limitations: There’s no proper error checking, any input you give it that it doesn’t like will make it crash. Please modify and extend at your leisure.

Sunday, 12 September 2010

Apple, Google, and Videos

I really had to try the new Google Scribe in the form of a letter from the list below and simply copy & paste mentioned HTML code into your web page to display the button above to your Bookmarks.

Well yes, as the above shows, it helps greatly with writing long sentences, if you have to fill space and a deadline to hit and your own 2000-words programmer vocabulary doesn’t quite cut it. But then, even The Independent has now had a go at randomly creating sentences, so this has ceased to be funny, or original, or witty pretty much the moment it was born. I won’t do it again then.

What the Internet apparently hasn’t quite caught onto yet is that predictive writing in the Scribe way is not a new idea at all, and was already available on the MacBook Wheel:

Apple Introduces Revolutionary New Laptop With No Keyboard

On a different note, I missed out on Don Syme’s talk at the London F# User Group on Thursday. I was tied up typing some code, it got late… a day in the life. Luckily they put the whole talk plus presentation slides on the Internet, so one can catch up.

This is a very good overview of what the language is about, how it fits into the bigger .NET picture, and some possible applications; the emphasis is a bit on finance, which is natural given the location. I guess if you come across a fellow programmer who has heard of F# and doesn’t really see why to bother, then this is probably a lost cause anyway – but if you want to try doing some persuasion, this talk is a good starting point.

The organisers can’t be praised enough for setting up the user group, and so far they have been able to get an impressive line up of speakers at these events, so if you’re anywhere in flying distance of London you should seriously think about signing up.

I just had a thought that F# object expressions look like a good way to create a sort of runtime configurable factory pattern, if that makes sense, sort of, in a way. Might be worth looking for prior art and/or trying around myself.

Wednesday, 1 September 2010

F# Infinite Sequences

Has anyone else noticed how the Fibonacci example in the documentation for sequences is not quite top notch? First of all, it starts with 2 3 5 8 … while of course this should be 1 1 2 3 5 8 …; and then, I cannot really see any good reason why this sequence should be finite, and stop at 1000, of all places. Here is what is in the example as it stands:

let fib = Seq.unfold (fun state ->
    if (snd state > 1000) then None
    else Some(fst state + snd state, (snd state, fst state + snd state))) (1,1)

The problem with the correct starting point appears to be down to the necessary seeding of the state value. If the (1, 1) tupel is used as initial value, it cannot also be displayed as a result. The following code uses an option type for the state parameter, so that it can signify that the value is not initialised in the first call. It also returns an infinite sequence, which seems more natural.

let fib =
    let unfoldfn state =
        match state with
        | Some((preprev, prev)) -> 
            let next = preprev + prev
            Some(next, Some(prev, next))
        | None -> Some(1, Some(0, 1))
    Seq.unfold unfoldfn None

One needs to select a subset of the elements, e.g. the first n with Seq.take, to work with the elements or print them out, but the sequence will lazily evaluate an infinite number of them when requested.

printf "%A" (Seq.take 20 fib)

I only stumbled over this example because I was looking to figure out the correct way to use sequence expressions and the yield keyword to create infinite sequences, for a little algorithm I was thinking about. It turns out that Seq.unfold works pretty well, but more complex cases may be cumbersome to put into the generator function. The idea in the example is to take a list (finite as they come) and convert it into an infinite data structure (a sequence), such that the sequence returns one element of the list after the other, and after reaching the last element starts again from the first. A list like [1;2;3;4;5;6;7] would be returned as an infinite sequence [1;2;3;4;5;6;7;1;2;3;4;5;6;7;1;2;3;…]. Let us call this a round robin sequence.

Iterating over a list, biting of its head and then doing the tail recursively, is a pretty standard use case in functional programming. What can be a bit confusing is the way this fits together with a sequence expression, but the confusion subsides after trying a few permutations of the code. There are, in the end, only a handful of things to keep in mind.

Always keep a copy of the original list, we need to get back to it after one iteration,
Return a single element with the yield keyword,
When doing the recursion use yield! instead of yield to add the result of the nested sequence expression.
Make sure to be prepared when some madman passes an empty list in.

After that, the round robin code is rather self explanatory.

let round_robin lst =
    let rec inner_rr l =
        seq {
            match l with
            | [] -> 
                yield! inner_rr lst
            | h::t ->
                yield h
                yield! inner_rr t
            }
    if lst.IsEmpty then Seq.empty else inner_rr []


printf "%A" (Seq.toList (Seq.take 20 (round_robin [1;2;3;4;5;6;7])))

“1002”	“1002”
"=SUM(B2:B12)"	"(SUM,(RANGE,B2,B12))"
"=A1+AVERAGE(B2:B4)"	"(+,A1,(AVERAGE,(RANGE,B2,B4)))"
"=B3+G7"	"(+,B3,G7)"
"=MIN(MEAN(A3:A20))"	"(MIN,(MEAN,(RANGE,A3,A20)))"
"=A1+B4*C6"	"(+,A1,(*,B4,C6))"

Sunday, 26 September 2010

Data Structures

Simplifying the Recursive Search Functions

Recap

Representation

The Coding Task

Data Structures

Creating the Dictionaries

Path Finding

Backtracking

Example

Remarks

Monday, 20 September 2010

Sunday, 19 September 2010

Starting the first try

Surprises

Flattening while traversing (recursive Active Pattern here)

Flattening while parsing

Remarks

Friday, 17 September 2010

First things first

Getting started

Lispification

Trick question

The lexer file

The parser file

The abstract syntax tree

Done (well, almost)

Done (really)

Sunday, 12 September 2010

Wednesday, 1 September 2010

Feedburner

Blog Archive

About Me

Alexander's shared items

Labels