I have a somewhat strange relation to parser since a while. Like everyone I started with writing little parsers by hand and bounced several times against yacc and flex failing to get around their very own syntax.
Along the the way there was ragel with its wonderful dot-output to visualize its state-engine, very neat and a mixed lexer and parser.
2-3 weeks ago I finally stumbled over LPEG and I’m happily writing parsers now. Like a simple one that can parse complex SELECT queries and indent them properly.
What helped me a lot writing a parser for SQL was that LPEG is using LUA to build the parser at runtime. On one side this gives you some nice runtime options to switch syntaxes if you need to, but the more important one is (Eric close your ears) Unit-Testing.
A very simple rule and you need pretty much everywhere is detecting alphabetic characters:
local l = require("lpeg") alpha = l.R("az", "AZ") -- [a-zA-Z] --- -- alpha and digits assert(matched_all(alpha, "a")) -- matched assert(matched_any(alpha, "aa")) -- the match stopped after the first char assert(matched_all(alpha ^ 0, "aaa")) -- ^0 is *, now we match all 'a's assert(matched_failed(alpha, "0")) -- it shouldn't match numbers
Right from the most basic rule we can write tests to verify that the parser actually works and does what we expect. How much time have you spent before writing tests for flex and yacc before ? Did we write any ?
l.R() builds a range like in the regex in the comment.
With the unit-tests at hand I started to implement more complex rules:
literal_char = alpha + l.S("_") -- /[_a-z]/i literal_unquoted = literal_char * ( literal_char + digit )^0 -- /[_a-z][_a-z0-9]*/i literal_quoted = l.P("`") * ( 1 - l.P("`") )^0 * l.P("`") -- /`[^`]*`/ literal = literal_unquoted + literal_quoted
Literals, we have two kinds: quoted and unquoted. The unquoted ones are littled in the characters they can use, the quoted ones are free to use any char or keyword they like.
l.P() matches a string,
* is the concat operator,
(1 - l.P("`")) is “all but the matched string”, and
^0 is “found at least 0 times” like the
* in regexes.
Step-by-step with unit-tests at hand you try more and more complex SQL fragments until you finally get to real SQL:
assert(matched_all(sql.select_pat, [[SELECT 1]])) ... assert(matched_all(sql.select_pat, [[SELECT 1 + 1 FROM `a` AS `b` WHERE a = b]])) ... assert(matched_all(sql.select_pat, [[SELECT 1 + 1 FROM (SELECT 1 FROM a)AS f JOIN (SELECT 1) AS f2]]))
My very own goal was to be able to indent a quite large SQL generated by hibernate to see how we can optimize it. The
EXPLAIN for that statement was 62 rows long, after formating the SQL it is 620 lines. It involved sub-queries, user-variables, functions, all kinds of
ORDER BYs and
SELECT options. Pretty much everything you can put into a
SELECT root.instance_id , root.parent_id , `exec_time_data_0`.delta_sum , `count_data_1`.delta_sum , `max_exec_time_data_2`.max_value , `rows_data_3`.delta_sum , `max_rows_data_4`.max_value , `text_data_5`.value , ( `exec_time_data_0`.delta_sum/`count_data_1`.delta_sum ) AS `avg_exec_time` , ( `rows_data_3`.delta_sum/`count_data_1`.delta_sum ) AS `avg_rows` , `database_data_8`.value , `text_hash_data_9`.value , `alias_data_10`.value , ( IFNULL(`alias_data_10`.value, `text_data_5`.value) ) AS `display_value` , `bytes_data_12`.delta_sum , `max_bytes_data_13`.max_value FROM ( SELECT instance_id , parent_id FROM inventory_instances WHERE type_id=5 ) AS root JOIN ( SELECT instance_id ...
Finally I have my SQL indenting in a nice extensible way. With some vim-colours on top I can now cut the Query into pieces and optimize it.
The whole thing was about 6 hours of work without knowing LPEG before, the final indenter is only a few lines of LUA:
local l = require("lpeg") local sql = require("sql") function tbl2str(t, indent) indent = indent or 0 local o = "" if type(t) == "table" then for k, v in pairs(t) do o = o .. tbl2str(v, indent + 2) end else o = o .. (" "):rep(indent) .. t:gsub("\n", "") .. "\n" end return o end local p = io.stdin:read("*a") local matches matches = assert(lpeg.match(l.Ct(sql.select_pat) * -1, p)) print(tbl2str(matches))
sql module is just the parser grammar,
sql.select_pat is the grammar for the SELECT statement (the
* -1 is LPEG for
followed by end of file).
As a last test it nice to see how fast this dynamic parser really is:
local l = require("lpeg") local sql = require("sql") local p = io.stdin:read("*a") for i = 1, 5000 do assert(lpeg.match(l.Ct(sql.select_pat) * -1, p)) end
time to measure the execution time:
$ time lua ./sql-bench.lua < ~/query.sql real 0m6.900s user 0m6.852s sys 0m0.028s
… makes 1.38ms to parse the evil query (260 lines, 8260 chars).
For a simple query like
SELECT 1 it goes down to 0.5ms. This is not really high-performance, but good enough for a indenter.
Even we this is only used for indenting, the underlying parser can also be used to create a nice syntax-tree and do transformations on it. It is all in there.
Sure, the parser-grammar needs some more rules to add proper escaping of strings and handling of grammar that is ambigious in the ‘real’ SQL parser of MySQL, but this is good enough as a starting point and was a nice exercise in writing parsers and unit-testing.