I have a somewhat strange relation to parser since a while. Like everyone I started with writing little parsers by hand and bounced several times against yacc and flex failing to get around their very own syntax.

Then I discovered lemon and used it in lighttpd for the configuration and HTTP parsing, finally a parser syntax I could read. But it still was a split between lexing and parsing.

Along the the way there was ragel with its wonderful dot-output to visualize its state-engine, very neat and a mixed lexer and parser.

2-3 weeks ago I finally stumbled over LPEG and I'm happily writing parsers now. Like a simple one that can parse complex SELECT queries and indent them properly.

Unit Testing

What helped me a lot writing a parser for SQL was that LPEG is using LUA to build the parser at runtime. On one side this gives you some nice runtime options to switch syntaxes if you need to, but the more important one is (Eric close your ears) Unit-Testing.

A very simple rule and you need pretty much everywhere is detecting alphabetic characters:

local l = require("lpeg")

alpha    = l.R("az", "AZ") -- [a-zA-Z]
---
-- alpha and digits
assert(matched_all(alpha, "a"))  -- matched
assert(matched_any(alpha, "aa")) -- the match stopped after the first char
assert(matched_all(alpha ^ 0, "aaa")) -- ^0 is *, now we match all 'a's
assert(matched_failed(alpha, "0")) -- it shouldn't match numbers

Right from the most basic rule we can write tests to verify that the parser actually works and does what we expect. How much time have you spent before writing tests for flex and yacc before ? Did we write any ?

l.R() builds a range like in the regex in the comment.

With the unit-tests at hand I started to implement more complex rules:

literal_char = alpha + l.S("_") -- /[_a-z]/i
literal_unquoted  = literal_char * ( literal_char + digit )^0  -- /[_a-z][_a-z0-9]*/i
literal_quoted  = l.P("`") * ( 1 - l.P("`") )^0 * l.P("`") -- /`[^`]*`/
literal  = literal_unquoted + literal_quoted

Literals, we have two kinds: quoted and unquoted. The unquoted ones are littled in the characters they can use, the quoted ones are free to use any char or keyword they like.

l.P() matches a string, * is the concat operator, (1 - l.P("`")) is "all but the matched string", and ^0 is "found at least 0 times" like the * in regexes.

Step-by-step with unit-tests at hand you try more and more complex SQL fragments until you finally get to real SQL:

assert(matched_all(sql.select_pat, [[SELECT 1]]))
...
assert(matched_all(sql.select_pat, [[SELECT 1 + 1 FROM `a` AS `b` WHERE a = b]]))
...
assert(matched_all(sql.select_pat, [[SELECT 1 + 1 FROM (SELECT 1 FROM a)AS f JOIN (SELECT 1) AS f2]]))

Real Life Testing

My very own goal was to be able to indent a quite large SQL generated by hibernate to see how we can optimize it. The EXPLAIN for that statement was 62 rows long, after formating the SQL it is 620 lines. It involved sub-queries, user-variables, functions, all kinds of JOINs, GROUP and ORDER BYs and SELECT options. Pretty much everything you can put into a SELECT query.

SELECT
  root.instance_id
  , root.parent_id
  , `exec_time_data_0`.delta_sum
  , `count_data_1`.delta_sum
  , `max_exec_time_data_2`.max_value
  , `rows_data_3`.delta_sum
  , `max_rows_data_4`.max_value
  , `text_data_5`.value
  ,  ( `exec_time_data_0`.delta_sum/`count_data_1`.delta_sum ) AS `avg_exec_time`
  ,  ( `rows_data_3`.delta_sum/`count_data_1`.delta_sum ) AS `avg_rows`
  , `database_data_8`.value
  , `text_hash_data_9`.value
  , `alias_data_10`.value
  ,  ( IFNULL(`alias_data_10`.value, `text_data_5`.value) ) AS `display_value`
  , `bytes_data_12`.delta_sum
  , `max_bytes_data_13`.max_value
FROM
  (
    SELECT
      instance_id
      , parent_id
    FROM
      inventory_instances
    WHERE
      type_id=5
  ) AS root
  JOIN
    (
      SELECT
        instance_id
        ...

Finally I have my SQL indenting in a nice extensible way. With some vim-colours on top I can now cut the Query into pieces and optimize it.

Modules

The whole thing was about 6 hours of work without knowing LPEG before, the final indenter is only a few lines of LUA:

local l   = require("lpeg")
local sql = require("sql")

function tbl2str(t, indent)
    indent = indent or 0
    local o = ""
    if type(t) == "table" then
            for k, v in pairs(t) do
                    o = o .. tbl2str(v, indent + 2)
            end
    else
            o = o .. (" "):rep(indent) .. t:gsub("\n", "") .. "\n"
    end
    return o
end

local p = io.stdin:read("*a")

local matches
matches = assert(lpeg.match(l.Ct(sql.select_pat) * -1, p))

print(tbl2str(matches))

The sql module is just the parser grammar, sql.select_pat is the grammar for the SELECT statement (the * -1 is LPEG for followed by end of file).

Performance

As a last test it nice to see how fast this dynamic parser really is:

local l   = require("lpeg")
local sql = require("sql")

local p = io.stdin:read("*a")

for i = 1, 5000 do
    assert(lpeg.match(l.Ct(sql.select_pat) * -1, p))
end

... and time to measure the execution time:

$ time lua ./sql-bench.lua < ~/query.sql
real    0m6.900s
user    0m6.852s
sys     0m0.028s

... makes 1.38ms to parse the evil query (260 lines, 8260 chars).

For a simple query like SELECT 1 it goes down to 0.5ms. This is not really high-performance, but good enough for a indenter.

Summary

Even we this is only used for indenting, the underlying parser can also be used to create a nice syntax-tree and do transformations on it. It is all in there.

Sure, the parser-grammar needs some more rules to add proper escaping of strings and handling of grammar that is ambigious in the 'real' SQL parser of MySQL, but this is good enough as a starting point and was a nice exercise in writing parsers and unit-testing.

Lua: Using LPEG for SQL Indenting