For the mysql-proxy I'm exploring several ways to write a small SQL parser for the internal commands. As I want to use basic SQL to modify the config and the internal operations I need a parser. Some time ago I used lemon which is used in sqlite3, which is easier to use than yacc, but you still have to write the lexer/tokenizer.

This time I looked into ragel which is a state-machine compiler.

Time get rid of code like this:

....
} else if (0 == g_ascii_strncasecmp(s->str + NET_HEADER_SIZE + 1, 
     C("update proxy_config set value=0 where option=\"proxy.profiling\""))) {
....

Ragel is a bit different than lex/yacc. Ragel combines the two into one and builds a full state-machine for the input stream. Just one file, one state-machine for the parser and lexer.

%%{
  machine sql;

  SELECT = /SELECT/i;
  FROM   = /FROM/i;
  WHERE  = /WHERE/i;

  ...

  ## SELECT
  action select_new {
    stmt_select *s = g_new0(stmt_select, 1);
    stmt->type = STMT_SELECT;
    stmt->stmt.select = s;

    s->result_fields = g_ptr_array_new();
    s->from_tables = g_ptr_array_new();
  }

  ...

  select_stmt =
    SELECT @select_new SP+
      fieldname %select_add_field
      ( SP* ',' SP* fieldname %select_add_field )* SP+
    FROM SP+ tablename %select_add_table
      ( SP+ WHERE SP+ fieldname SP* comp SP* expr )?;

  main := (select_stmt | update_...) 0 @{ parse_ok = 1; };
}
  • we have some keywords in SQL which are case-insensitve. To easy our life we define some expressions (SELECT, FROM, WHERE, ...)
  • SP are whitespaces, SP+ is 1-or-more, SP* is any-amount (as in regexes).

The only special pieces are the elements starting with %, @, < or similar: those are actions. You can bind code to state-transitions if a every flexible way:

identifier = ( alpha [0-9a-zA-Z_]* | '`' (any* -- '`' ) '`');

fieldname = identifier >clear $append;

fieldname is a identifier. When we enter the token we call the clear action to (>action is called when you enter a state) and append every char while we are in this state ($action):

action clear { g_string_truncate(stmt->str, 0); }
action append { g_string_append_c(stmt->str, fc); }

The select_stmt from about is calling %select_add_field (%action is called when we leave a state) to add the collected string (our fieldname) to the select.

action select_add_field {
  stmt_select *s = stmt->stmt.select;
  g_ptr_array_add(s->result_fields, g_strdup(stmt->str->str));
}

I attached my current parser which should match the needs of the admin-console of the proxy:

## compile the .rl-file into XML
$ ragel -o sqlparser.rlc sqlparser.rl
## turn the XML into a c-code
$ rlgen-cd -o sqlparser.c sqlparser.rlc
$ gcc -o sqlparser `pkg-config glib-2.0 --cflags --libs` sqlparser.c
$  ./sqlparser 'SELECT fld1 FROM tbl1'
(add_table(tbl1))
result = 1 (stmt.type = 1)
select.fields = 1
select.field[0] = fld1
select.tables = 1
select.tables[0] = tbl1

With some experimentation it turned out that this parser doesn't have a problem with reserved words. As soon as I would introduce aliases for fields and tables a SELECT could look like:

SELECT from FROM from from

Hmm, is it:

SELECT from AS from FROM from

or

SELECT from FROM from AS from

ragel has the difference-operator which excludes tokens from the list of valid matches:

## keywords
SELECT = /SELECT/i;
UPDATE = /UPDATE/i;
DELETE = /DELETE/i;
INSERT = /INSERT/i;
SET    = /SET/i;
WHERE  = /WHERE/i;
FROM   = /FROM/i;
INTO   = /INTO/i;
VALUES = /VALUES/i;

reserved = ( SELECT | UPDATE | DELETE | INSERT | SET |
    WHERE | FROM | INTO | VALUES );

identifier = ( alpha [0-9a-zA-Z_]* - reserved |
    '`' (any* -- '`' ) '`');

Case close :)

ragel also allows us to build a dot-graph to verify the state-machine visually.


Comments

Enable javascript to load comments.