For a little while on the train this afternoon I was convinced I had coined a new expression. Back at my computer, though, the Internet informed me that people have been “lispifying” things since before I was born, and I just hadn’t been in on the joke. Nevertheless, it’s a good word, and I will proceed to do just that, lispify something.
There was a question on Stackoverflow a couple of days ago as to how it would be possible in F# to take the content of an Excel spreadsheet cell and interpret it to find out if it was a simple value or a formula, and in the latter case, to return the operation and arguments. The poster of the question was after something that would return either the value or a tuple with the details of the formula, for the exact request see here.
This struck me as a good opportunity to get my hands on the F# lexer and parser. As an example it seems just simple enough not to get completely lost in all sort of arcane specialist tweaks that only hardcore compiler builders know about, and yet it offers some real insight into how the whole process works.
First things first
We need some tools. As per the current version of F#, the parser/lexer binaries are not in the main distribution, but in the F# PowerPack, so this needs to be downloaded and installed. I have used Visual Studio 2010 for this, but 2008 might just work. And in order to get all the bits and pieces together quickly, we are using the F# Parsed Language Starter template for Visual Studio.
There is some documentation out there on how to use FsLex and FsYacc. Robert Pickering gives a good introduction in Foundations of F#; A bit more technical is the chapter in Expert F# (Don Syme, Adam Granicz, Antonio Cisternino), which also touches on potential issues with parsing, conflicts, operator precedence, and some more. I am guessing that if you read this then you know both books anyway. There is also the Wikibook, which I like very much. Read through all this before continuing.
Getting started
Open a new project in Visual Studio. Find the F# project templates in the list and select the Parsed Language Starter.
This creates a project to parse instructions for a little calculator application. We won’t be needing this, but it is a rather nice way to get all the necessary files created and make sure that the lexer and parser are called during the build. Currently neither is officially integrated into Visual Studio, which means that the only way to use both programs is either a separate command line window or a pre-build step in the project settings; both is a bit clumsy and doesn’t do much for keeping track of build dependencies either.
The files in the new project are
- lexer.fsl, which has the syntax description that is needed to correctly tokenise the input strings.
- parser.fsy, which describes our grammar and is used to make sense of the tokens that the lexer creates.
- ast.fs, with definitions of a number of discriminated unions that make up the abstract syntax tree, the internal representation of the lexed and parsed input, i.e. the data structures that our program works with.
- program.fs, from where the lexer and parser are called, fed with input, and where the rest of our program logic can go in.
Lispification
In terms of a brief specification, here is what the program is supposed to do:
1. Take a string in the form of a spreadsheet cell. I am using OpenOffice Calc where I am sitting now, and Excel on other occasions, and the basic format is pretty much the same. You have either
- a number, or
- a formula, starting with ‘=’
In the case of a formula, you can have numbers again, or references to other cells, in the form “A1”, “B2”, etc. and arithmetic combinations thereof, using the normal arithmetic operations. There can also be calls to built-in functions; the syntax for these is the function name, followed by a pair of parentheses with argument values inside. Valid cell content looks like
1002
=SUM(B2:B12)
=A1+AVERAGE(B2:B4)
2. Convert this into an internal representation that our program can work with. I should mention at this point that we can’t just create tuples; depending on the formula that is parsed, the result can have arbitrary length and depth, and tuples are not good for this. What we will get is a tree of discriminated unions, which are easy enough to traverse and use in F# anyway.
3. Once we have an internal representation, also create output strings, merely for visual pleasure. Output strings will look like comma-separated lists enclosed in parenthesis. The first element of such a list is the operation to perform (if any), and the rest are the arguments that it is performed on. Lists can be nested if the formulas are more complicated. The original cells above would be output as
(1002)
(SUM,(RANGE,B2,B12))
(+,A1,(AVERAGE,(RANGE,B2,B4)))
Trick question
Where does the term “lispify” come from?
The lexer file
The lexer.fsl file in the new F# project contains a header, enclosed in curly brackets {..}, a number of regular expression definitions and a single tokenisation rule. The header bit is used to pull in necessary definitions from other modules and we leave it intact. The regular expressions are used to match against bits of the input strings, and the rule instructs the lexer to emit tokens for every match that is found.
The token themselves are defined in the parser file, so both really mutually depend on each other. Some of the tokens have types associated with them, and are assigned a value whenever a corresponding match is found.
Where for instance the parser file contains a definition
%token <System.Double> FLOAT
any lexer rule that creates this token needs to assign it a value of type double (usually the bit of the input string that the rule has just been matched against):
| ['-']?digit+('.'digit+)?(['e''E']digit+)? { FLOAT (Double.Parse(lexeme lexbuf)) }
Many of the regular expressions and most of the rule are also used for our program, but just to avoid confusion, let’s remove all of them (leave intact the {...} header, though!), and replace them with the following bits:
(* Regular Expressions *) let digit = ['0'-'9'] let whitespace = [' ' '\t' ] let newline = ('\n' | '\r' '\n') let cellref = ['A'-'Z']+digit+ // matches a cell reference, like "A2" let function = ['A'-'Z']+ // matches a function nameand
(* Tokenisation Rule *) rule tokenize = parse | whitespace { tokenize lexbuf } | newline { tokenize lexbuf } | "+" { PLUS } | "-" { MINUS } | "*" { ASTER } | "/" { SLASH } | "(" { LPAREN } | ")" { RPAREN } | ['-']?digit+ { INT32 (Int32.Parse(lexeme lexbuf)) } | ['-']?digit+('.'digit+)?(['e''E']digit+)? { FLOAT (Double.Parse(lexeme lexbuf)) } | eof { EOF } | cellref { CELLREF(lexeme lexbuf) } // reference to another cell, like A3, G6, etc. | "=" { FORMULA } | function { FUNCTION(lexeme lexbuf) } // a function in the spreadsheet cell | ":" { RANGESEP } // separates two cell references to make a range
The parser file
The parser.fsy file is where everything comes together. The parser needs to know about the tokens that the lexer is to produce, and in fact, all those tokens are defined in this file. It also defines the grammar (== in which combination and order does our input make any sense, and what sense exactly?).
And depending on the tokens passed on from the lexer and how they are combined, the parser emits F# code into our program. The only code that we will produce here are discriminated union constructors, but in principle we could do other things. However, all we want to do now is set up a data structure, so that’s good enough.
The parser file can get a bit confusing, because it mixes all these different things, and you are working with a lot of identifiers that come out of three different name spaces:
- token identifiers, that are defined here but used first by the lexer,
- F# identifiers from the code templates, in our case the discriminated union constructor identifiers, and
- the identifiers that make up the parser’s grammar.
It doesn’t really help that they normally represent related concepts and have very similar names. In the sample code, I have tried to make the distinction more obvious: all token identifiers are all uppercase (FORMULA), all names used in the grammar start with an uppercase ‘P’ (PExpression), and the rest are symbols from the F# side.
In the same way as with the lexer, we don’t take any prisoners here and get rid of all the token and grammar bits, and replace them with our own (leave the header %{..%}, though):
%start start %token <System.Int32> INT32 %token <System.Double> FLOAT %token PLUS MINUS ASTER SLASH %token LPAREN RPAREN %token EOF %token <System.String> CELLREF %token FORMULA %token <System.String> FUNCTION %token RANGESEP %left PLUS MINUS %left ASTER SLASH %type < Ast.Cell > start %%and
start: PCell { $1 }A few remarks: The %start token is where the root of our data structure will be, it is named ‘start’ (appropriately) and the associated type from our F# code is Ast.Cell
PCell:
| PConstant { ConstantCell $1 }
| FORMULA PExpression { FormulaCell $2 }
PExpression:
| PConstant { Constant $1 }
| PExpression ASTER PExpression { InfixOperation(Times($1, $3)) }
| PExpression SLASH PExpression { InfixOperation(Divide($1, $3)) }
| PExpression PLUS PExpression { InfixOperation(Plus($1, $3)) }
| PExpression MINUS PExpression { InfixOperation(Minus($1, $3)) }
| FUNCTION LPAREN PExpression RPAREN { Function($1,$3) }
| PCellref { CellRef($1) }
PConstant:
| INT32 { Integer($1) }
| FLOAT { Float($1) }
PCellref:
| CELLREF { SingleCellRef $1 }
| CELLREF RANGESEP CELLREF { RangeCellRef($1, $3) }
%type < Ast.Cell > start
This means that the parser will create F# code that creates an Ast.Cell from scratch, and elements of all the other data types depend on that. In a bit we will be able to see that this is indeed the case. From the parser’s point of view, the top level in the evaluation is ‘PCell’ (which is everything), which consists of elements ‘PConstant’ or ‘FORMULA PExpression’ etc. Where token names are used they match directly with the tokens from the lexer, for other identifiers there is another case further down the hierarchy (‘PConstant’ e.g. has either INT32 or FLOAT), but eventually every path from ‘PCell’ must always end in one or more lexer tokens.
For each element of the grammar that could be matched, the F# expression in curly brackets is emitted into the code. We can see that these all construct a hierarchy of nested discriminated unions.
The abstract syntax tree
The AST is in Ast.fs. This is a very theoretical term, what we have here in fact are type definitions for all the discriminated unions that eventually will make up our data structure. They need to be able to fit into each other in a way that doesn’t conflict with what we told the parser about our grammar, and one of them needs to be able to stand alone, but other than that, this bit is pretty straightforward.
Again, we remove all the type definitions from the existing file, and replace with our own:
type Cell = | FormulaCell of Expression | ConstantCell of Value and Expression = | CellRef of Reference | Function of String * Expression | Constant of Value | InfixOperation of BinaryOperator and BinaryOperator = | Plus of Expression * Expression | Minus of Expression * Expression | Times of Expression * Expression | Divide of Expression * Expression and Value = | Float of Double | Integer of Int32 and Reference = | SingleCellRef of String | RangeCellRef of String * String
Done (well, almost)
There is some code in the program.fs file, which may or may not work with our new lexer and parser, but I have just removed it and left the two most relevant lines
let cell = "=SUM(B2:B12)" let lexbuff = LexBuffer<char>.FromString(cell) let parsed = Parser.start Lexer.tokenize lexbuffThe project should now build ok (try!) and if you run it in the debugger until this point, you should be able to see the whole data structure in the variable ‘parsed’. For programmatic evaluation of the Excel formula, this should be sufficient. The type of the result is the type of the top most discriminated union that we have defined for this, and it is not very difficult to traverse the whole tree using pattern matches.
Done (really)
Here is the code that traverses the whole tree and creates the promised text output. I have added a file astwalker.fs to the project and put it in there. If you do the same, please remember that the two source files (program.fs and astwalker.fs) now need module names (because they are more than one), and you need to move files in the project so that program.fs remains the last file.
let rec walkCell x = match x with | ConstantCell(v) -> walkValue v | FormulaCell(f) -> walkExpression f and walkExpression x = match x with | InfixOperation(b) -> walkBinaryOperator b | Constant(c) -> walkValue c | Function(f,arg) -> walkFunction f arg | CellRef(r) -> walkRef r and walkBinaryOperator x = let extractOp = function | Plus(a, b) -> (a, b, "+") | Minus(a, b) -> (a, b, "-") | Times(a, b) -> (a, b, "*") | Divide(a, b) -> (a, b, "/") let (var1, var2, op) = extractOp x "(" + op + "," + walkExpression var1 + "," + walkExpression var2 + ")" and walkValue x = match x with | Float(f) -> f.ToString() | Integer(i) -> i.ToString() and walkFunction function_name arg = "(" + function_name + "," + walkExpression arg + ")" and walkRef x = match x with | SingleCellRef(r) -> r | RangeCellRef(r1, r2) -> "(RANGE," + r1 + "," + r2 + ")"If we go back to program.fs, we can create a little function that calls all three components
let convert cell = let lexbuff = LexBuffer<char>.FromString(cell) let parsed = Parser.start Lexer.tokenize lexbuff AstWalker.walkCell parsedand then call it for a bunch of test input strings
let teststrings = [ "1002"; "=SUM(B2:B12)"; "=A1+AVERAGE(B2:B4)"; "=B3+G7"; "=MIN(MEAN(A3:A20))"; "=A1+B4*C6" ]; teststrings |> List.iter (fun s -> (convert s |> printfn "%A\t%A" s))The output should look like
“1002” “1002” "=SUM(B2:B12)" "(SUM,(RANGE,B2,B12))" "=A1+AVERAGE(B2:B4)" "(+,A1,(AVERAGE,(RANGE,B2,B4)))" "=B3+G7" "(+,B3,G7)" "=MIN(MEAN(A3:A20))" "(MIN,(MEAN,(RANGE,A3,A20)))" "=A1+B4*C6" "(+,A1,(*,B4,C6))"
Limitations: There’s no proper error checking, any input you give it that it doesn’t like will make it crash. Please modify and extend at your leisure.
No comments:
Post a Comment