MySQL Proxy: Adaptive Slow Query Log

Kris brought up a simple question:

Why do I have to set the slow-query-time by hand ? Why can't the server figure out the normal query time and tell me when something is unusual slow ?

In earlier articles I already talked about that the proxy can log the query-time in microseconds and we already implemented a Query Histogram with average and max query-time. Unusual slow ...

Almost all (actually, 99.7%) of the values lie within 3 standard deviations of the mean (or between the mean minus 3 times the standard deviation and the mean plus 3 times the standard deviation). Statisticians use the following notation to represent this: μ ± 3σ.

see http://en.wikipedia.org/wiki/68-95-99.7_rule

Hmm, so everything which is slower than μ ± 3σ. is unusual. We can use this to auto-tune the threshold for the slow-query log.

First we want to find the average query time over all queries and they standard deviation:

# 2007-09-14 16:53:56 [154] user: merlin, db: merlin
  Query:                     "commit"
  Norm_Query:                "`commit` "
  query_time:                104.60 ms
  global(avg_query_time):    744.40 us
  global(stddev_query_time): 10.93 ms
  global(count):             29180
  query(avg_query_time):     1311.59 us
  query(stddev_query_time):  22.27 ms
  query(count):              3164
  (qtime > global-threshold)

Oops, a COMMIT takes a while. This one is 104ms, average is 1.3ms, but as you see the stddev is pretty large. But, this query was logged because its query time was larger than most other queries:

                    if bl.avg.count > config.min_queries and
                       inj.query_time > avg_query_time + 3 * stddev_query_time then
                            o = o .. "  (qtime > global-threshold)\n"
                            log_query = true
                    end

Auto PROCESSLIST

A common problem for a slow query is not the query itself, but its concurrency and letting the query run in parallel for the same data. Lock Contention, Pressure on the Caches, IO-bound, ...

To track those problems more easily, you can tell the script to append a SHOW FULL PROCESSLIST when it found a slow query and put that into the logfile too:

mysql> SET GLOBAL analyze_query.auto_processlist = 1;

# 2007-09-14 16:15:35 [119] user: merlin, db: merlin
  Query:                     "SELECT 1"
  Norm_Query:                "SELECT ? "
  query_time:                15.57 ms
  global(avg_query_time):    704.78 us
  global(stddev_query_time): 2456.16 us
  global(count):             829
  query(avg_query_time):     331.45 us
  query(stddev_query_time):  1230.25 us
  query(count):              165
  (global)
  (query)
SHOW FULL PROCESSLIST
  [0]{ 35, "merlin", "localhost:28907", NULL, "Sleep", 35, "", NULL }
  [1]{ 36, "merlin", "localhost", NULL, "Sleep", 2280, "", NULL }
  [2]{ 118, "merlin", "localhost:19290", "merlin", "Sleep", 2, "", NULL }
  [3]{ 119, "merlin", "localhost:19291", "merlin", "Query", 0, NULL, "SHOW FULL PROCESSLIST" }
  [4]{ 120, "merlin", "localhost:19292", "merlin", "Sleep", 2, "", NULL }

Auto EXPLAIN

Or how about a EXPLAIN against the SELECT query, just in case ?

mysql> SET GLOBAL analyze_query.auto_explain = 1;

# 2007-09-14 16:15:35 [119] user: merlin, db: merlin
  Query:                     "SELECT 1"
  Norm_Query:                "SELECT ? "
  query_time:                15.57 ms
  global(avg_query_time):    704.78 us
  global(stddev_query_time): 2456.16 us
  global(count):             829
  query(avg_query_time):     331.45 us
  query(stddev_query_time):  1230.25 us
  query(count):              165
EXPLAIN <query>
  [0]{ 1, "SIMPLE", NULL, NULL, NULL, NULL, NULL, NULL, NULL, "No tables used" }

Query Histogram

As this is based on the Query Histogram we talked about earlier, you can also track the histogram yourself while the data is passing through. All the times are in micro-seconds, as always:

root@127.0.0.1:4040 [merlin]> select * from histogram.queries;
+---------------------------------------------+----------+------------+---------------+
| query                                       | COUNT(*) | AVG(qtime) | STDDEV(qtime) |
+---------------------------------------------+----------+------------+---------------+
| INSERT INTO `monitor_history_times` VALU... |       12 |     985.67 |       1599.54 |
| SELECT ( ? > ? )                            |       12 |     186.00 |         14.32 |
| SELECT COUNT( `schedule_id` ) FROM `moni... |       68 |     574.85 |        201.27 |
| SELECT * FROM `dc_schedule` WHERE `serve... |       22 |    1349.23 |         47.98 |
| SELECT ( ( ? > ? ) && ( ( ( ? / ? ) * ? ... |       60 |     251.02 |         63.76 |
| SELECT SQL_CALC_FOUND_ROWS `monitor_hist... |       69 |   22201.77 |     108515.71 |
| `commit`                                    |     3560 |    1227.78 |      21020.70 |
...

Looks like the SELECT SQL_CALC_FOUND_ROWS takes a while (22ms). On the other side we have a bunch of queries here which run at network-latency speed (180us for a SELECT 2 > 1).

But what about the COMMIT statements ? (Yes, I'm running with innodb_flush_log_at_trx_commit = 2 )

Script

The script we talk about here is located in SVN already and if you are already running the 0.6.0 version you can updated your packaged script with a simple:

$ wget http://svn.mysql.com/svnpublic/mysql-proxy/trunk/lib/analyze-query.lua

It is using the auto-config module to expose all the config options for runtime config:

    proxy.global.config.analyze_query = {
            analyze_queries  = true,   -- track all queries
            auto_explain     = false,  -- execute a EXPLAIN on SELECT queries after we logged
            auto_processlist = false,  -- execute a SHOW FULL PROCESSLIST after we logged
            min_queries      = 1000,   -- start logging after evaluating 1000 queries
            min_query_time   = 1000,   -- don't log queries less then 1ms
            query_cutoff     = 160,    -- only show the first 160 chars of the query
            log_slower_queries = false -- log the query if it is slower than the queries of the same kind
    }

You can use:

SET GLOBAL analyze_query.analyze_queries = 0

to disable the logging and bypass all the code and reduce the latency. As soon as you need it again, you just set it to 1 again.

min_queries is used to reduce the error and start reporting about slow queries after 1000 queries are executed.
query_cutoff is only interesting for the internal SELECT to limit the characters we print in query column
log_slower_queries will add some more logging if enabled if a query is slower than the queries of its kind.

Auto PROCESSLIST

Auto EXPLAIN

Query Histogram

Script

Comments