The previous post on parsing spreadsheet cells showed how to transform a text string with what is a typical spreadsheet cell into an internal representation in F# discriminated unions, so that it can be further processed by the code. There were quite a few little gaps left that are trivial to fill, like accommodating additional arithmetic operators like ^, or allowing for negation like –A2.
But there is also one huge gaping hole, which should become obvious when looking at a formula like
=SUM(A1,A2,A3,A4)
and which merits to be explored separately: argument lists. To be sure, the separator character can be either a comma or a semicolon, depending on the actual spreadsheet program and on language settings. Without loss of generality, we will be using commas. We also assume that the length of the list is unknown in advance, otherwise this would be a simple extension of our existing grammar rules. Also, lists of expressions6 is something that only ever happens in function calls (within our spreadsheet sample, that is). F# programs deal with data structures of variable length by using some sort of recursion, and this is where we start our first approach.
Starting the first try
The code from the previous post needs a few minor changes. First of all, the lexer needs to know about the existence of commas, and that they might be important to us; we want to see them as tokens. In the lexer.fsl file, that means adding a line to the rule we already have there:
| "," { COMMA }
The identifier for the new token has to be defined, so the list of tokens in parser.fsy needs an additional line:
%token COMMA
The AST has to be able to express the concept of an argument list, or rather a list of expressions. We add this to the existing Expression DU in ast.fs:
(* ast.fs *) (* ... *) and Expression = (* ... *) | ListExpression of Expression list (* ... *)
Now the only thing left to do is tell the parser how to recognise a list of expressions in the input token stream and make a ListExpression out of them. The current grammar composes a function out of two elements, a name and an expression.
| FUNCTION LPAREN PExpression RPAREN { Function($1,$3) }
After the introduction of argument lists, we have a special case for functions arguments, which is not covered by the grammar for generic expressions, as previously. We invent a new construct for whatever we put into a function argument. Generic expressions need to be one part of it (they haven’t gone away), and the other option are our new variable-length lists of expressions.
// parser.fsy // ... PExpression: // ... | FUNCTION LPAREN PFunctionArgs RPAREN { Function($1,$3) } // ... PFunctionArgs: | PExpression { $1 } | PExpression COMMA PFunctionArgs { ListExpression($1::[$3]) }
The new PFunctionArgs takes either any combination of tokens that can be interpreted as a PExpression and returns it unchanged; since this is used from within the PExpression case for function calls, it means that the token(s) just end up in the second argument of the Function DU constructor, i.e. it behaves exactly like before. The difference comes in the second case of PFunctionArgs. This finds any PExpression that is followed by a comma, and makes it the first element of a new list; after that, it takes everything after the comma, as long as it can also be interpreted as PFunctionArgs. If it can – and if it matches the first case of PFunctionArgs – we end up with a two-element list object, otherwise, the recursion can go on until some component of the token string is found that matches the first case.
This recursion happens in the parser code, not in the AST.
There is no way that the ListExpression constructor can enter into any sort of recursion. But the recursion within the parser results in a variable number of nested constructor calls.
The previous post had some test code to visualise the contents of the AST. The astwalker.fs file needs some addition in order to recognise the new ListExpression, but since this contains just a standard list type whose element types are already well known and accounted for, it is easy to add this case using standard List library functions.
(* astwalker.fs *) (* ...*) and walkExpression x = match x with (* ...*) | ListExpression(l) -> "[" + List.fold (fun acc e -> acc + "," + walkExpression e) (List.head l |> walkExpression) (List.tail l) + "]" (* There's a bit of ugliness going on here in order to get a correct comma count in the output string. *)
After making all the changes the whole project should still/again build successfully and we can try the test function with a simple formula.
"=SUM(A1,A2,A3,A4)" |> convert |> printfn "%A"
Surprises
The first, and maybe biggest, surprise is that this is all still works in a way and really can take apart a list of expressions and create some output for it. What we could have seen coming is that the output wouldn’t be exactly the simple list of expressions we were after. Rather we have
"(SUM,[A1,[A2,[A3,A4]]])"
Going back to the new bits of grammar in the parser file, we can try to understand events step by step. Let’s say we hit the grammar rule for PFunctionArgs with the following string.
A1,A2,A3,A4
The second case of PFunctionArgs knows that A2 has to be treated as a PExpression, and finds after some looking through the grammar tree that this is a CELLREF, hence wraps it up in a SingleCellRef.
ListExpression(SingleCellRef(A1)::[$3])
Now it needs something to do with the second value ($3).
A2,A3,A4
This turns out to match nicely with PFunctionArgs again, so it gets the same treatment.
ListExpression(SingleCellRef(A2)::[$3])
Then this is put back into the gap left in the first iteration, so by now the total expression looks like:
ListExpression(SingleCellRef(A1)::[ListExpression(SingleCellRef(A2)::[$3])])
And so on, until in the end the second argument ($3) is reduced to just the last element of the list, for which an appropriate PExpression is found, and then recursion ends. Writing out the whole result and applying the cons operators gives the result.
ListExpression([SingleCellRef(A1);ListExpression([SingleCellRef(A2);ListExpression([SingleCellRef(A3);SingleCellRef(A4)])])])
This is, in fact, not a data structure that in itself poses too many problems. But in the context of a larger AST, and when we imagine that each element of the list can itself contain further lists and complex expressions, and mainly because what we wanted initially was an argument list (and not an argument-linked-list-of-lists-sort-of-thing), we definitely want to flatten this.
Flattening while traversing (recursive Active Pattern here)
The flattening logic can be built into the code that traverses the list; in our sample that would be the functions in astwalker.fs. This is where Active Patterns come in. Active Patterns allow to match against arbitrarily complex data structures, because they allow the normal pattern matching mechanism to be replaced by custom code. Active Pattern definitions are, in fact, functions, so inside them we can do more or less everything we can do in any other function, including recursion.
This Active Pattern contains the code to match the data structure we have built up so far and ‘match’ it against a flat list of Expression DUs.
let rec (|FlatListExpression|) e = match e with | ListExpression([eh; FlatListExpression(et)]) -> eh::et | _ -> [e] (* The value e is either a ListExpression that contains a list whose first element is an Expression and second is another ListExpression that also needs flattening; or it is a plain Expression element. We know there are no other cases, so we can use a catch-all pattern for the second case. *)
In the first match case within the Active Pattern definition, we pull the first list element out of the ListExpression and cons it with the list that is in the second list element, which in turn is another ListExpression, and hence may need flattening. For this, we use the Active Pattern we are just defining. This is truly awesome.
The only little glitch is that in order to use it in astwalker.fs we need to add it to the existing list of mutually recursive functions by defining it as
and (|FlatListExpression|) e = match e with | ListExpression([eh; FlatListExpression(et)]) -> eh::et | _ -> [e]
This definition takes account of mutuality (the Active Pattern needs ListExpression and the walk functions will need the Active Pattern) and of recursion. The new pattern can now be used at the same location where we previously matched against ListExpression, and the result is that the ‘match’ found will be with the flattened list.
and walkExpression x = match x with (* ... *) //| ListExpression(l) -> | FlatListExpression(l) -> "[" + List.fold (fun acc e -> acc + "," + walkExpression e) (List.head l |> walkExpression) (List.tail l) + "]"Going back to the main program module and running the test function again
"=SUM(A1,A2,A3,A4)" |> convert |> printfn "%A"we get
"(SUM,[A1,A2,A3,A4])"
Flattening while parsing
Despite of the blinding awesomeness of the whole Active Pattern mechanism, it may not be the preferred solution. The main concern is here that the flattening has to be performed every single time the AST is traversed. It is, of course, thinkable to only do this once and write back the result – or rather, to create a new data structure with the unflattened lists being replaced. However, it may be easier to do the whole process while parsing.
Now, Active Patterns work very well when reading existing data structures, but they don’t seem to help all that much when building them up from scratch. What we are looking for is something that can replace the ListExpression constructor, take the same input (this comes from the parser code), and ends up with a flat ListExpression that can be fed into the Function constructor (as per our grammar).
let MakeListExpression l = match l with | h::[ListExpression(t)] -> ListExpression(h::t) | _ -> ListExpression(l)
The MakeListExpression function is expected to be called for the first time (in the innermost nested call, see the nesting for ListExpression above) with a list of Expression expressions. This is the second match case, and we just wrap the list in a ListExpression object.
In subsequent calls we get a list with the first element being an Expression, and the second one of the ListExpression expressions that were created in previous calls to the function. We unwrap the list that we previously had wrapped into a ListExpression, cons the first element, and wrap it all up again for the next step. When the process stops at any given point, we always end up with a ListExpression that contains a flat list with all Expression expressions we have processed.
The parser file so far only uses DU constructors, but there is nothing to keep it from calling just any F# function from our code, as long as all the F# expressions we ask it to produce fit together according to the grammar rules. In this case we can just replace the ListExpression constructor with a call to MakeListExpressions.
PFunctionArgs: | PExpression { $1 } | PExpression COMMA PFunctionArgs { MakeListExpression($1::[$3]) }
For completeness, we also need to put the function somewhere the parser can use it. The ast.fs file comes to mind, and is in fact a good place. Out of the box, this is started with a namespace declaration, though, and cannot contain any values/functions. So other than adding the function to the file, we also need to change
namespace Ast
to
module Ast
Remarks
This also with nested calls, try parsing "=SUM(A1,A2,AVERAGE(B1,B2,B3,B4),A4)" or similar.
Showing the first solution and then discarding it seems like a waste of time. However, there may be reasons in a particular project to stick with it. And it shows a nice example of an Active Pattern making sense of a complex data structure.
There is at least one other solution. I preferred something that could be done in F# code, because it is easier to understand and debug, and leaves the lexer and parser files short and clean.
In the code that produces the string output, I decided to write out Expression lists within square brackets “[exp1, exp2, exp3]”. It would probably be more in line with the rest of the code to use a more Lisp like style, e.g. “(ARGLIST, exp1,exp2,exp3)”. However, the text output is only for visualisation and doesn’t affect the internal data structures anyway, and I like it more this way.
No comments:
Post a Comment