Date Tags mysql

If you read the MySQL manual you might have seen the ORDER BY RAND() to randomize the the rows and using the LIMIT 1 to just take one of the rows.

SELECT name
  FROM random
 ORDER BY RAND()
 LIMIT 1;

This example works fine and is fast if you only when let's say 1000 rows. As soon as you have 10000 rows the overhead for sorting the rows becomes important. Don't forget: we only sort to throw nearly all the rows away.

I never liked it. And there are better ways to do it. Without a sorting. As long as we have a numeric primary key.

For the first examples we assume the be ID is starting at 1 and we have no holes between 1 and the maximum value of the ID.

move the work into the application

First idea: We can simplify the whole job if we calculate the ID beforehand in the application.

SELECT MAX(id) FROM random;
## generate random id in application
SELECT name FROM random WHERE id = <random-id>

As MAX(id) == COUNT(id) we just generate random number between 1 and MAX(id) and pass it into the database to retrieve the random row.

The first SELECT is a NO-OP and is optimized away. The second is a eq_ref against a constant value and also very fast.

move the job into the database

But is it really necessary to do it in the application ? Can't we do it in the database ?

# generating a random ID
> SELECT RAND() * MAX(id) FROM random;
+------------------+
| RAND() * MAX(id) |
+------------------+
|  689.37582507297 |
+------------------+
# oops, this is a double, we need an int

> SELECT CEIL(RAND() * MAX(id)) FROM random;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
|                    1000000  |
+-------------------------+
# better. But how is the performance:

> EXPLAIN 
   SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type  | rows | Extra       |
+----+-------------+-------+-------+------+-------------+
|  1 | SIMPLE      | random  | index | 1000000  | Using index |
+----+-------------+-------+-------+------+-------------+
## a index scan ? we lost our optimization for the MAX()

> EXPLAIN 
   SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra                        |
+----+-------------+-------+------+------+------------------------------+
|  1 | PRIMARY     | NULL  | NULL | NULL | No tables used               |
|  2 | SUBQUERY    | NULL  | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## a simple Sub-Query is bringing us the performance back.

Ok, now we know how to generate the random ID, but how to get the row ?

> EXPLAIN
SELECT name
  FROM random
 WHERE id = (SELECT CEIL(RAND() *
                         (SELECT MAX(id)
                            FROM random));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows    | Extra                        |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
|  1 | PRIMARY     | random | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where                  |
|  3 | SUBQUERY    | NULL   | NULL | NULL          | NULL | NULL    | NULL |    NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message                                  |
+-------+------+------------------------------------------+
| Note  | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+

NO, NO, NO. Don't go this way. This is the most obvious, but also the most wrong way to do it. The reason: the SELECT in the WHERE clause is executed for every row the outer SELECT is fetching. This leads to 0 to 4091 rows, depending on your luck.

We need a way to make sure that the random-id is only generated once:

SELECT name
  FROM random JOIN
       (SELECT CEIL(RAND() *
                    (SELECT MAX(id)
                       FROM random)) AS id
        ) AS r2
       USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | random     | const  |    1 |                              |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The inner SELECT is generating a constant TEMPORARY table and the JOIN is selecting just on single row. Perfect.

No Sorting, No Application, Most parts of the query optimized away.

adding holes to the numbers

To generalize the last solution we add the possibility of holes, like when you DELETE rows.

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table      | type   | rows | Extra                        |
+----+-------------+------------+--------+------+------------------------------+
|  1 | PRIMARY     | <derived2> | system |    1 |                              |
|  1 | PRIMARY     | r1         | range  |  689 | Using where                  |
|  2 | DERIVED     | NULL       | NULL   | NULL | No tables used               |
|  3 | SUBQUERY    | NULL       | NULL   | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+

The JOIN now adds all the IDs which are greater or equal than our random value and selects only the direct neighboor if a direct match is not possible. BUT as soon as one row is found, we stop (the LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). As we are using >= instead of a = we can get rid of a the CEIL and get the same result with a little less work.

Equal Distribution

As soon as the distribution of the IDs is not equal anymore our selection of rows isn't really random either.

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      107 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |       50 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      132 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      481 |
+----+----------------------------------+----------+

The RAND function is generating IDs like 9 to 15 which all lead to the id 16 to be selected as the next higher number.

There is no real solution for this problem, but your data is mostly constant you can add a mapping table which maps the row-number to the id:

> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
|      1 |         1 |
|      2 |         2 |
|      3 |         4 |
|      4 |         8 |
|      5 |        16 |
+--------+-----------+

The row_id is now free of holes again and we can run our random query again:

SELECT name FROM holes
  JOIN (SELECT r1.random_id
         FROM holes_map AS r1
         JOIN (SELECT (RAND() *
                      (SELECT MAX(row_id)
                         FROM holes_map)) AS row_id)
               AS r2
        WHERE r1.row_id >= r2.row_id
        ORDER BY r1.row_id ASC
        LIMIT 1) as rows ON (id = random_id);

After 1000 fetches we see a equal distribution again:

> select * from holes;
+----+----------------------------------+----------+
| id | name                             | accesses |
+----+----------------------------------+----------+
|  1 | d12b2551c6cb7d7a64e40221569a8571 |      222 |
|  2 | f82ad6f29c9a680d7873d1bef822e3e9 |      187 |
|  4 | 9da1ed7dbbdcc6ec90d6cb139521f14a |      195 |
|  8 | 677a196206d93cdf18c3744905b94f73 |      207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa |      189 |
+----+----------------------------------+----------+

Maintaining the holes

Let's take the tables as before:

DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
  id SERIAL,
  name VARCHAR(32) NOT NULL UNIQUE
);

DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
  id SERIAL,
  r2_id bigint unsigned NOT NULL UNIQUE
);

When ever we change something in r2 we want to that r2_equi_dist is updated too.

DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
 AFTER INSERT ON r2 FOR EACH ROW
BEGIN
  DECLARE m BIGINT UNSIGNED DEFAULT 1;

  SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
  SELECT IFNULL(m, 1) INTO m;
  INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;

DELETE FROM r2;

INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  2 | a09a3959d68187ce48f4fe7e388926a9 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     2 |
|  3 |     3 |
|  4 |     4 |
+----+-------+

INSERT is quite simple, on DELETE we have to update the equi-dist-id to maintain the hole-free setup:

DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
 AFTER DELETE ON r2 FOR EACH ROW
BEGIN
  DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
  UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;

DELETE FROM r2 WHERE id = 2;

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
|  4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |     4 |
+----+-------+

UPDATE is straight-forward again. We only have to maintain the Foreign Key constraint:

DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
 AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
  UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;

UPDATE r2 SET id = 25 WHERE id = 4;

SELECT * FROM r2;
+----+----------------------------------+
| id | name                             |
+----+----------------------------------+
|  1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
|  3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
|  1 |     1 |
|  2 |     3 |
|  3 |    25 |
+----+-------+

Multiple Rows at once

If you want to get more than one row returned, you can:

  • execute the Query several times
  • write a stored procedure which is executing the query and stores the result in a temp-table
  • make a UNION

a stored procedure

Stored procedures provide you with the structures you know from your favourite programming language:

  • loops
  • control structures
  • procedures
  • ...

For this task we only need a LOOP:

DELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
  DROP TEMPORARY TABLE IF EXISTS rands;
  CREATE TEMPORARY TABLE rands ( rand_id INT );

loop_me: LOOP
    IF cnt < 1 THEN
      LEAVE loop_me;
    END IF;

    INSERT INTO rands
       SELECT r1.id
         FROM random AS r1 JOIN
              (SELECT (RAND() *
                            (SELECT MAX(id)
                               FROM random)) AS id)
               AS r2
        WHERE r1.id >= r2.id
        ORDER BY r1.id ASC
        LIMIT 1;

    SET cnt = cnt - 1;
  END LOOP loop_me;
END$$
DELIMITER ;

CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
|  133716 |
|  702643 |
|  112066 |
|  452400 |
+---------+

I leave to the reader to fix the issues:

  • use dynamic SQL and pass in the name of the temporary table
  • use a UNIQUE index on the table and catch the UNIQUE key violation to remove the possible duplicates in the result-set

Performance

Now let's see what happends to our performance. We have 3 different queries for solving our problems.

  • Q1. ORDER BY RAND()
  • Q2. RAND() * MAX(ID)
  • Q3. RAND() * MAX(ID) + ORDER BY ID

Q1 is expected to cost N * log2(N), Q2 and Q3 are nearly constant.

The get real values we filled the table with N rows ( one thousand to one million) and executed each query 1000 times.

   100        1.000      10.000     100.000    1.000.000
Q1  0:00.718s  0:02.092s  0:18.684s  2:59.081s  58:20.000s
Q2  0:00.519s  0:00.607s  0:00.614s  0:00.628s   0:00.637s
Q3  0:00.570s  0:00.607s  0:00.614s  0:00.628s   0:00.637s

As you can see the plain ORDER BY RAND() is already behind the optimized query at only 100 rows in the table.

A more detailed analysis of those queries is at analyzing-complex-queries.


Comments

Enable javascript to load comments.