Kris brought up a simple question:
Why do I have to set the slow-query-time by hand ? Why can't the server figure out the normal query time and tell me when something is unusual slow ?
In earlier articles I already talked about that the proxy can log the query-time in microseconds and we already implemented a Query Histogram with average and max query-time. Unusual slow ...
Almost all (actually, 99.7%) of the values lie within 3 standard deviations of the mean (or between the mean minus 3 times the standard deviation and the mean plus 3 times the standard deviation). Statisticians use the following notation to represent this: μ ± 3σ.
see http://en.wikipedia.org/wiki/68-95-99.7_rule
Hmm, so everything which is slower than μ ± 3σ. is unusual. We can use this to auto-tune the threshold for the slow-query log.
First we want to find the average query time over all queries and they standard deviation:
# 2007-09-14 16:53:56 [154] user: merlin, db: merlin
Query: "commit"
Norm_Query: "`commit` "
query_time: 104.60 ms
global(avg_query_time): 744.40 us
global(stddev_query_time): 10.93 ms
global(count): 29180
query(avg_query_time): 1311.59 us
query(stddev_query_time): 22.27 ms
query(count): 3164
(qtime > global-threshold)
Oops, a COMMIT
takes a while. This one is 104ms
, average is 1.3ms
, but as you see the stddev is pretty large. But, this query was logged because its query time was larger than most other queries:
if bl.avg.count > config.min_queries and
inj.query_time > avg_query_time + 3 * stddev_query_time then
o = o .. " (qtime > global-threshold)\n"
log_query = true
end
Auto PROCESSLIST
A common problem for a slow query is not the query itself, but its concurrency and letting the query run in parallel for the same data. Lock Contention, Pressure on the Caches, IO-bound, ...
To track those problems more easily, you can tell the script to append a SHOW FULL PROCESSLIST
when it found a slow query and put that into the logfile too:
mysql> SET GLOBAL analyze_query.auto_processlist = 1;
# 2007-09-14 16:15:35 [119] user: merlin, db: merlin
Query: "SELECT 1"
Norm_Query: "SELECT ? "
query_time: 15.57 ms
global(avg_query_time): 704.78 us
global(stddev_query_time): 2456.16 us
global(count): 829
query(avg_query_time): 331.45 us
query(stddev_query_time): 1230.25 us
query(count): 165
(global)
(query)
SHOW FULL PROCESSLIST
[0]{ 35, "merlin", "localhost:28907", NULL, "Sleep", 35, "", NULL }
[1]{ 36, "merlin", "localhost", NULL, "Sleep", 2280, "", NULL }
[2]{ 118, "merlin", "localhost:19290", "merlin", "Sleep", 2, "", NULL }
[3]{ 119, "merlin", "localhost:19291", "merlin", "Query", 0, NULL, "SHOW FULL PROCESSLIST" }
[4]{ 120, "merlin", "localhost:19292", "merlin", "Sleep", 2, "", NULL }
Auto EXPLAIN
Or how about a EXPLAIN
against the SELECT
query, just in case ?
mysql> SET GLOBAL analyze_query.auto_explain = 1;
# 2007-09-14 16:15:35 [119] user: merlin, db: merlin
Query: "SELECT 1"
Norm_Query: "SELECT ? "
query_time: 15.57 ms
global(avg_query_time): 704.78 us
global(stddev_query_time): 2456.16 us
global(count): 829
query(avg_query_time): 331.45 us
query(stddev_query_time): 1230.25 us
query(count): 165
EXPLAIN <query>
[0]{ 1, "SIMPLE", NULL, NULL, NULL, NULL, NULL, NULL, NULL, "No tables used" }
Query Histogram
As this is based on the Query Histogram we talked about earlier, you can also track the histogram yourself while the data is passing through. All the times are in micro-seconds, as always:
root@127.0.0.1:4040 [merlin]> select * from histogram.queries;
+---------------------------------------------+----------+------------+---------------+
| query | COUNT(*) | AVG(qtime) | STDDEV(qtime) |
+---------------------------------------------+----------+------------+---------------+
| INSERT INTO `monitor_history_times` VALU... | 12 | 985.67 | 1599.54 |
| SELECT ( ? > ? ) | 12 | 186.00 | 14.32 |
| SELECT COUNT( `schedule_id` ) FROM `moni... | 68 | 574.85 | 201.27 |
| SELECT * FROM `dc_schedule` WHERE `serve... | 22 | 1349.23 | 47.98 |
| SELECT ( ( ? > ? ) && ( ( ( ? / ? ) * ? ... | 60 | 251.02 | 63.76 |
| SELECT SQL_CALC_FOUND_ROWS `monitor_hist... | 69 | 22201.77 | 108515.71 |
| `commit` | 3560 | 1227.78 | 21020.70 |
...
Looks like the SELECT SQL_CALC_FOUND_ROWS
takes a while (22ms
). On the other side we have a bunch of queries here which run at network-latency speed (180us
for a SELECT 2 > 1
).
But what about the COMMIT
statements ? (Yes, I'm running with innodb_flush_log_at_trx_commit = 2
)
Script
The script we talk about here is located in SVN already and if you are already running the 0.6.0 version you can updated your packaged script with a simple:
$ wget
http://svn.mysql.com/svnpublic/mysql-proxy/trunk/lib/analyze-query.lua
It is using the auto-config module to expose all the config options for runtime config:
proxy.global.config.analyze_query = {
analyze_queries = true, -- track all queries
auto_explain = false, -- execute a EXPLAIN on SELECT queries after we logged
auto_processlist = false, -- execute a SHOW FULL PROCESSLIST after we logged
min_queries = 1000, -- start logging after evaluating 1000 queries
min_query_time = 1000, -- don't log queries less then 1ms
query_cutoff = 160, -- only show the first 160 chars of the query
log_slower_queries = false -- log the query if it is slower than the queries of the same kind
}
You can use:
SET GLOBAL analyze_query.analyze_queries = 0
to disable the logging and bypass all the code and reduce the latency. As soon as you need it again, you just set it to 1
again.
min_queries
is used to reduce the error and start reporting about slow queries after 1000 queries are executed.query_cutoff
is only interesting for the internal SELECT to limit the characters we print in query columnlog_slower_queries
will add some more logging if enabled if a query is slower than the queries of its kind.
Comments