For the mysql-proxy I'm exploring several ways to write a small SQL parser for the internal commands. As I want to use basic SQL to modify the config and the internal operations I need a parser. Some time ago I used lemon which is used in sqlite3, which is easier to use than yacc, but you still have to write the lexer/tokenizer.
This time I looked into ragel which is a state-machine compiler.
Time get rid of code like this:
....
} else if (0 == g_ascii_strncasecmp(s->str + NET_HEADER_SIZE + 1,
C("update proxy_config set value=0 where option=\"proxy.profiling\""))) {
....
Ragel is a bit different than lex/yacc. Ragel combines the two into one and builds a full state-machine for the input stream. Just one file, one state-machine for the parser and lexer.
%%{
machine sql;
SELECT = /SELECT/i;
FROM = /FROM/i;
WHERE = /WHERE/i;
...
## SELECT
action select_new {
stmt_select *s = g_new0(stmt_select, 1);
stmt->type = STMT_SELECT;
stmt->stmt.select = s;
s->result_fields = g_ptr_array_new();
s->from_tables = g_ptr_array_new();
}
...
select_stmt =
SELECT @select_new SP+
fieldname %select_add_field
( SP* ',' SP* fieldname %select_add_field )* SP+
FROM SP+ tablename %select_add_table
( SP+ WHERE SP+ fieldname SP* comp SP* expr )?;
main := (select_stmt | update_...) 0 @{ parse_ok = 1; };
}
- we have some keywords in SQL which are case-insensitve. To easy our life we define some expressions (SELECT, FROM, WHERE, ...)
SP
are whitespaces,SP+
is 1-or-more,SP*
is any-amount (as in regexes).
The only special pieces are the elements starting with %
, @
, <
or
similar: those are actions. You can bind code to state-transitions if a
every flexible way:
identifier = ( alpha [0-9a-zA-Z_]* | '`' (any* -- '`' ) '`');
fieldname = identifier >clear $append;
fieldname is a identifier. When we enter the token we call the clear action to (>action is called when you enter a state) and append every char while we are in this state ($action):
action clear { g_string_truncate(stmt->str, 0); }
action append { g_string_append_c(stmt->str, fc); }
The select_stmt
from about is calling %select_add_field
(%action
is
called when we leave a state) to add the collected string (our
fieldname) to the select.
action select_add_field {
stmt_select *s = stmt->stmt.select;
g_ptr_array_add(s->result_fields, g_strdup(stmt->str->str));
}
I attached my current parser which should match the needs of the admin-console of the proxy:
## compile the .rl-file into XML
$ ragel -o sqlparser.rlc sqlparser.rl
## turn the XML into a c-code
$ rlgen-cd -o sqlparser.c sqlparser.rlc
$ gcc -o sqlparser `pkg-config glib-2.0 --cflags --libs` sqlparser.c
$ ./sqlparser 'SELECT fld1 FROM tbl1'
(add_table(tbl1))
result = 1 (stmt.type = 1)
select.fields = 1
select.field[0] = fld1
select.tables = 1
select.tables[0] = tbl1
With some experimentation it turned out that this parser doesn't have a problem with reserved words. As soon as I would introduce aliases for fields and tables a SELECT could look like:
SELECT from FROM from from
Hmm, is it:
SELECT from AS from FROM from
or
SELECT from FROM from AS from
ragel has the difference-operator which excludes tokens from the list of valid matches:
## keywords
SELECT = /SELECT/i;
UPDATE = /UPDATE/i;
DELETE = /DELETE/i;
INSERT = /INSERT/i;
SET = /SET/i;
WHERE = /WHERE/i;
FROM = /FROM/i;
INTO = /INTO/i;
VALUES = /VALUES/i;
reserved = ( SELECT | UPDATE | DELETE | INSERT | SET |
WHERE | FROM | INTO | VALUES );
identifier = ( alpha [0-9a-zA-Z_]* - reserved |
'`' (any* -- '`' ) '`');
Case close :)
ragel also allows us to build a dot-graph to verify the state-machine visually.
Comments