The problem of how to handle trees in SQL has been talked about alot. The basic 3 ways are:

store the full path for each entry
store the parent for each node
use nested tree

Nested tree is good for read-many-write-less applications where the tree doesn't check over time too much as a write-operation is heavy-weight most of the time.

Referencing the Parent through the full path

Using the variant of the path involves a lot of string handling and is always slow. See below:

# use a full path to each node
CREATE TABLE tree_path (
  node_path VARCHAR(1024) PRIMARY KEY,
  name VARCHAR(32) NOT NULL,
  INDEX (name)
) ENGINE = innodb;

INSERT INTO tree_path VALUES
  ( '/0',     'Earth' ),
  ( '/0/0',   'Europe' ),
  ( '/0/0/1', 'Germany' ),
  ( '/1',     'Moon' ),
  ( '/0/1',   'Asia' );

# search for parent of 'Asia'
SELECT t1.name
  FROM tree_path AS t1
 WHERE t1.node_path = (
   SELECT LEFT(t2.node_path,
            LENGTH(t2.node_path)
            - LENGTH(SUBSTRING_INDEX(t2.node_path, '/', -1))
            - 1
          )
     FROM tree_path AS t2
    WHERE t2.name = 'Asia');
+-------+
| name  |
+-------+
| Earth |
+-------+

# get all entries below Earth
SELECT t2.name
  FROM tree_path AS t1
    JOIN tree_path as t2
 WHERE t1.node_path = LEFT(t2.node_path, LENGTH(t1.node_path))
   AND t1.node_path < t2.node_path
   AND t1.name = 'Earth';
+---------+
| name    |
+---------+
| Europe  |
| Germany |
| Asia    |
+---------+

Referencing the Parent by ID

The self referencing variant which is using can't handle all jobs in the DB.

# use a full path to each node
CREATE TABLE tree_id (
  node_id INT UNSIGNED NOT NULL PRIMARY KEY,
  name VARCHAR(32) NOT NULL,
  parent_id INT UNSIGNED,
  FOREIGN KEY (parent_id) REFERENCES tree_id (node_id),
  INDEX (name)
) ENGINE = innodb;

INSERT INTO tree_id VALUES
  ( 1, 'Earth',   NULL ),
  ( 2, 'Europe',  1  ),
  ( 3, 'Germany', 2 ),
  ( 4, 'Moon',    NULL ),
  ( 5, 'Asia',    1 );

# search for parent of 'Asia'
SELECT t1.name
  FROM tree_id AS t1
 WHERE t1.node_id = (
   SELECT t2.parent_id
     FROM tree_id AS t2
    WHERE t2.name = 'Asia');
+-------+
| name  |
+-------+
| Earth |
+-------+
1 row in set (0.00 sec)

# get all entries below Earth
## we can't do this in one generic query

With the help of Store Procedures we can make it possible as SP provide recursion what is necessary here.

fetch the first level of childs
fetch the first level of childs for each childs
... and so on until we don't have any new childs

Check this:

DELIMTER $$
DROP PROCEDURE get_childs_sub$$
CREATE PROCEDURE get_childs_sub (IN nid INT)
BEGIN
  DECLARE n INT;
  DECLARE done INT DEFAULT 0;
  DECLARE cur CURSOR FOR SELECT node_id FROM tree_id WHERE parent_id = nid;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  /* if we have not children the FETCH will fail with NOT FOUND */
  OPEN cur;
get_childs_fetch_loop: LOOP
    FETCH cur INTO n;

    IF done = 1 THEN
      LEAVE get_childs_fetch_loop;
    END IF;
    INSERT INTO __childs VALUES ( n );

    CALL get_childs_sub(n);

  END LOOP get_childs_fetch_loop;
  CLOSE cur;
END$$

### DROP TABLE would result in a problem here.
###
###
DROP PROCEDURE get_childs$$
CREATE PROCEDURE get_childs (IN nid INT)
BEGIN
  DECLARE n INT;

  DROP TEMPORARY TABLE IF EXISTS __childs;
  CREATE TEMPORARY TABLE __childs (
    node_id INT NOT NULL PRIMARY KEY
  );

  /* check if there really are sub-nodes */
  SELECT COUNT(*)
    FROM tree_id
   WHERE parent_id = nid INTO n;

  IF n <> 0 THEN
    /* fetch the node_ids of the childs
      into the __childs table */
    CALL get_childs_sub(nid);
  END IF;

  SELECT t1.*
    FROM tree_id AS t1
      JOIN __childs USING(node_id);

END$$
CALL  get_childs(1)$$
+---------+---------+-----------+
| node_id | name    | parent_id |
+---------+---------+-----------+
|       2 | Europe  |         1 |
|       3 | Germany |         2 |
|       5 | Asia    |         1 |
+---------+---------+-----------+
DELIMITER ;

Triggers

The triggers in MySQL 5.0.3 are quite limited, but we can already make some use out of them. The problem we want to handle is optimized search for domains, sub-domains and so on.

In our example want to concentrate on 'a user might have multiple email-addresses'.

CREATE TABLE user (
  user_id INT NOT NULL auto_increment PRIMARY KEY,
  name VARCHAR(32) NOT NULL
) engine = innodb;

CREATE TABLE email (
  email_id INT NOT NULL auto_increment PRIMARY KEY,
  user_id INT NOT NULL,
  address VARCHAR(128) NOT NULL,
  UNIQUE (address),
  FOREIGN KEY (user_id) REFERENCES user (user_id)
) engine = innodb;

INSERT INTO user VALUES
  ( NULL, 'Joe Dohn' ),
  ( NULL, 'Karl Heinze' ),
  ( NULL, 'Jet Wong' );

INSERT INTO email (user_id, address) VALUES
  ( 1, 'joe@yahoo.com' ),
  ( 1, 'joe@hotmail.com' ),
  ( 2, 'karl@gmx.de' ),
  ( 2, 'karl@bt.co.uk' ),
  ( 3, 'jet@hotmail.cn' );

As a basic operation we want to check which user has a account at in .uk domain.

SELECT u.name
  FROM user AS u
 WHERE u.user_id IN (
   SELECT user_id
     FROM email
    WHERE address LIKE '%.uk');

This is slow if you have a long list of email addresses. LIKE can only be optimized if the percent-sign is at the end.

REVERSE() can handle this for us at INSERT and UPDATE time. But we want to keep this transparent for the user.

ALTER TABLE email
  ADD reverse_address VARCHAR(128) NOT NULL;
UPDATE email SET reverse_address = REVERSE(address);
ALTER TABLE email
  ADD UNIQUE(reverse_address);

Whenever the user INSERTs something into the table the reverse_address field should be set directly without user-intervention. We need pre-INSERT trigger which replaces the default value by a REVERSE() of the address field.

DELIMITER $$
CREATE TRIGGER tbi_email_reverse_address
 BEFORE INSERT ON email FOR EACH ROW
BEGIN
  IF NEW.address IS NOT NULL THEN
    SET NEW.reverse_address = REVERSE(NEW.address);
  END IF;
END$$
DELIMITER ;

Let's insert a new email address for Karl and see if it works.

SELECT *
  FROM email;
+----------+---------+-----------------+-----------------+
| email_id | user_id | address         | reverse_address |
+----------+---------+-----------------+-----------------+
|        1 |       1 | joe@yahoo.com   | moc.oohay@eoj   |
|        2 |       1 | joe@hotmail.com | moc.liamtoh@eoj |
|        3 |       2 | karl@gmx.de     | ed.xmg@lrak     |
|        4 |       2 | karl@bt.co.uk   | ku.oc.tb@lrak   |
|        5 |       3 | jet@hotmail.cn  | nc.liamtoh@tej  |
+----------+---------+-----------------+-----------------+
INSERT INTO email (user_id, address) VALUES
  (2, 'karl@web.de' );

SELECT *
  FROM email;
+----------+---------+-----------------+-----------------+
| email_id | user_id | address         | reverse_address |
+----------+---------+-----------------+-----------------+
|        1 |       1 | joe@yahoo.com   | moc.oohay@eoj   |
|        2 |       1 | joe@hotmail.com | moc.liamtoh@eoj |
|        3 |       2 | karl@gmx.de     | ed.xmg@lrak     |
|        4 |       2 | karl@bt.co.uk   | ku.oc.tb@lrak   |
|        5 |       3 | jet@hotmail.cn  | nc.liamtoh@tej  |
|        6 |       2 | karl@web.de     | ed.bew@lrak     |
+----------+---------+-----------------+-----------------+

It works. The same is necessary for the update of the address field. The trigger does exactly the same.

DELIMITER $$
CREATE TRIGGER tbu_email_reverse_address
 BEFORE UPDATE ON email FOR EACH ROW
BEGIN
  IF NEW.address IS NOT NULL THEN
    SET NEW.reverse_address = REVERSE(NEW.address);
  END IF;
END$$
DELIMITER ;

Does it work ?

UPDATE email
   SET address = 'karl@hotmail.com'
 WHERE email_id = 6;

SELECT *
  FROM email;
+----------+---------+------------------+------------------+
| email_id | user_id | address          | reverse_address  |
+----------+---------+------------------+------------------+
|        1 |       1 | joe@yahoo.com    | moc.oohay@eoj    |
|        2 |       1 | joe@hotmail.com  | moc.liamtoh@eoj  |
|        3 |       2 | karl@gmx.de      | ed.xmg@lrak      |
|        4 |       2 | karl@bt.co.uk    | ku.oc.tb@lrak    |
|        5 |       3 | jet@hotmail.cn   | nc.liamtoh@tej   |
|        6 |       2 | karl@hotmail.com | moc.liamtoh@lrak |
+----------+---------+------------------+------------------+

It does. Now we have this automagically generated field. The user hasn't been told about it.

DELIMITER $$
CREATE PROCEDURE sp_get_email_by_domain ( IN adr VARCHAR(128) )
BEGIN
  SELECT *
    FROM email
   WHERE reverse_address LIKE REVERSE(CONCAT('%',adr));
END$$
DELIMITER ;

CALL sp_get_email_by_domain('.uk');
+----------+---------+---------------+-----------------+
| email_id | user_id | address       | reverse_address |
+----------+---------+---------------+-----------------+
|        4 |       2 | karl@bt.co.uk | ku.oc.tb@lrak   |
+----------+---------+---------------+-----------------+

DELIMITER $$
CREATE PROCEDURE sp_get_user_by_domain ( IN adr VARCHAR(128) )
BEGIN
  SELECT u.*
    FROM user AS u JOIN email USING (user_id)
   WHERE reverse_address LIKE REVERSE(CONCAT('%',adr));
END$$
DELIMITER ;

CALL sp_get_user_by_domain('.uk');
+---------+-------------+
| user_id | name        |
+---------+-------------+
|       2 | Karl Heinze |
+---------+-------------+

As the REVERSE() puts the percent-sign in front the query results in a range-request which is compared to the table-scan very fast.

Function and Triggers with DML

DELIMITER $$
DROP FUNCTION get_child_name$$
CREATE FUNCTION get_child_name (id INT)
  RETURNS VARCHAR(128)
BEGIN
  DECLARE a VARCHAR(128);
  SELECT name FROM tree_id WHERE node_id = id INTO a;
  RETURN a;
END$$
SELECT get_child_name(1) AS name$$
+-------+
| name  |
+-------+
| Earth |
+-------+
DELIMITER ;

Is the same true for DML statements in triggers ?

DELIMITER $$
DROP TABLE IF EXISTS log_updates$$
CREATE TABLE log_updates (
  tbl_name VARCHAR(64) NOT NULL,
  row_id BIGINT NOT NULL,
  user VARCHAR(64) NOT NULL,
  mtime DATETIME NOT NULL
) engine = innodb$$
DROP TRIGGER user.tau_log_updates$$
CREATE TRIGGER tau_log_updates
  AFTER UPDATE ON user FOR EACH ROW
BEGIN
  DECLARE a VARCHAR(128);

  INSERT INTO log_updates
    ( tbl_name, row_id, user, mtime) VALUES
    ( "user", OLD.user_id, CURRENT_USER, CURRENT_TIMESTAMP);
END$$
DELIMITER ;

SET @@autocommit = 0;
BEGIN;
## explicit locking for the trigger
## this is currently necessary
LOCK TABLE log_updates WRITE, user WRITE, mysql.proc READ;
UPDATE user SET name = 'Karl Heise' WHERE user_id = 2;
SHOW WARNINGS;
SELECT * FROM log_updates;
# +----------+--------+----------------+---------------------+
# | tbl_name | row_id | user           | mtime               |
# +----------+--------+----------------+---------------------+
# | user     |      2 | root@localhost | 2005-04-02 14:51:32 |
# +----------+--------+----------------+---------------------+
ROLLBACK;
UNLOCK TABLES;
SELECT * FROM log_updates;
# empty set

All the tables are transactional here and as you see the trigger works and is rollback as expected. Fine.

Porting world-db to trees

To test the recursive SPs we take the world-db from mysql documentation and take geographic categories as levels in the tree.

planet
continent
region
country
district
city

As the original structure is quite flat, let's blow it up a little bit :).

DROP TABLE IF EXISTS world_tree;
CREATE TABLE world_tree (
  id INT UNSIGNED NOT NULL auto_increment PRIMARY KEY,
  name VARCHAR(128),
  type ENUM('planet', 'continent', 'region', 'country', 'district', 'city') NOT NULL,
  parent_id INT UNSIGNED,
  INDEX (parent_id)
) engine=myisam;
# insert the base-node
INSERT INTO world_tree
  VALUES ( 1, 'Earth', 'planet', NULL );
# feed the continents
INSERT INTO world_tree
  SELECT DISTINCT NULL, continent, 'continent',
         (SELECT id
            FROM world_tree
           WHERE name = 'Earth')
    FROM Country;
# regions
INSERT INTO world_tree
  SELECT DISTINCT NULL, c.region,  'region',
         (SELECT id
            FROM world_tree AS t
           WHERE t.name = c.continent AND t.type = 'continent' ) as parent_id
    FROM Country AS c;
# countries
INSERT INTO world_tree
  SELECT NULL, c.name, 'country',
         (SELECT id
            FROM world_tree AS t
           WHERE t.name = c.region AND t.type = 'region') as parent_id
    FROM Country AS c;
# fetch districts
INSERT INTO world_tree
  SELECT DISTINCT NULL, c.district, 'district',
         (SELECT id
            FROM world_tree AS t
              JOIN Country AS co
                ON co.name = t.name
           WHERE co.code = c.countryCode AND t.type = 'country') as parent_id
    FROM City AS c WHERE c.district = 'Gaza';
# cities
INSERT INTO world_tree
  SELECT NULL, c.name, 'city',
         (SELECT t.id
            FROM world_tree AS t
              JOIN world_tree AS t2
                ON t2.id = t.parent_id
              JOIN Country as co
                ON t2.name = co.name
           WHERE t.name = c.district
             AND t.type = 'district'
             AND co.code = c.countrycode ) as parent_id
    FROM City AS c;

Ok, let's start with some simple queries on the tree:

# how deep is the tree
DROP PROCEDURE IF EXISTS sp_tree_depth_sub;
DELIMITER $$
CREATE PROCEDURE sp_tree_depth_sub(IN nid INT, OUT ndepth INT UNSIGNED)
BEGIN
  DECLARE depth INT UNSIGNED DEFAULT 0;
  DECLARE n INT;
  DECLARE maxdepth INT UNSIGNED DEFAULT 0;
  DECLARE done INT DEFAULT 0;
  DECLARE cur CURSOR FOR SELECT id FROM world_tree WHERE parent_id = nid;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  OPEN cur;
loopy: LOOP
    FETCH cur INTO n;

    IF done = 1 THEN
      LEAVE loopy;
    END IF;
    CALL sp_tree_depth_sub(n, depth);
    SET depth = depth + 1;
    IF depth > maxdepth THEN
      SET maxdepth = depth;
    END IF;
  END LOOP loopy;
  CLOSE cur;
  SET ndepth = maxdepth;
END$$
DROP PROCEDURE IF EXISTS sp_tree_depth$$
CREATE PROCEDURE sp_tree_depth(IN nid INT, OUT ndepth INT UNSIGNED)
BEGIN
  DECLARE n INT;
  SELECT id FROM world_tree LIMIT 0 INTO n;
  CALL sp_tree_depth_sub(nid, ndepth);
END$$
CALL sp_tree_depth(1, @a)$$
SELECT @a$$
DELIMITER ;

Getting the Depth of a Tree II

The last attempt on this topic resulted in a quite slow SP. It walked the tree and asked every node for its node and walked them too. This is good for a memory-based tree but bad for a SQL storage.

SQL can do JOINs, let's use them for this job.

# Let's assume this setup of the tree
1
  2
     5
     6
  3
     9
     8
        4

The idea is now to join from left to right and count how long you can join

First round:
1 - 2
1 - 3

Second round:
2 - 5
2 - 6
3 - 9
3 - 8

Third round:
8 - 4

Ok, 3 levels below the 1.

DROP PROCEDURE IF EXISTS sp_tree_depth2_sub;
DELIMITER $$
CREATE PROCEDURE sp_tree_depth2_sub (INOUT depth INT)
BEGIN
  DECLARE n INT DEFAULT 0;
  delete from __b;
  SET depth = depth + 1;
loopy: LOOP
    insert into __b
      select t.id
        from world_tree AS t
          JOIN __a AS t1
            ON t1.id = t.parent_id;

    delete from __a;
    insert into __a select * from __b;
    delete from __b;

    SELECT count(*) from __a INTO n;
    IF n = 0 THEN
      LEAVE loopy;
    END IF;
    SET depth = depth + 1;
  END LOOP loopy;
END$$
DROP PROCEDURE IF EXISTS sp_tree_depth2$$
CREATE PROCEDURE sp_tree_depth2 (IN nid INT, OUT odepth INT)
BEGIN
  DECLARE depth INT DEFAULT 0;

  insert into __a select t.id FROM world_tree AS t WHERE t.parent_id = nid;
  CALL sp_tree_depth2_sub (depth);

  SET odepth = depth;
END$$
DELIMITER ;
CREATE TABLE __a (id int);
CREATE TABLE __b LIKE __a;
CALL sp_tree_depth2(1, @a);
# Query OK, 0 rows affected (0.45 sec)
DROP TABLE __a;
DROP TABLE __b;
SELECT @a;

Yeaha, down to a part of a second. Just a little bit of mess with pre-creating tables because of a bug mentioned above. Otherwise the CREATE TABLE would be a CREATE TEMPORARY TABLE and it would be in the SP itself.

Rethinking algorithms always wins.

A flexible tree class using Dynamic SQL I promised this set of functions back in April 2005 at the MySQL UC in Santa Clara where I talked about the features you see above.

Starting with 5.0.13 the ground work for dynamic SQL was laid out as it was possible to use PREPARE inside a stored procedure.

DELIMITER $$
DROP PROCEDURE IF EXISTS sp_tree_depth$$
CREATE PROCEDURE sp_tree_depth (IN tblname VARCHAR(64), IN nid INT, OUT odepth INT)
BEGIN
  ...

  SET @stmt := CONCAT('insert into __a select t.id FROM ', tblname, ' AS t WHERE t.parent_id = ', nid);
  PREPARE stmt_sp_tree_depth FROM @stmt;

  ...

The PREPARE statement can take user variable as input and we can use it to generate SQL statements on the fly, hence the name dynamic SQL. We need this as we don't want to hard-code the table names into the SPs.

I prepared a set of SPs to work on tree-structures which have this setup:

id is INT, is the node-id and is the PRIMARY KEY of the table
parent_id is a FOREIGN KEY to id in the same table

Part of the tree-SPs are the folloing functions: - sp_tree_depth - sp_tree_get_subnodes - sp_tree_get_parent - sp_tree_delete_subnodes - sp_tree_delete_node - sp_tree_move_subnodes - sp_tree_move_node

Check out the source file about the parameter to the SPs and how to use them. They are well documented.

Adding a new node was left out as we don't know the structure of the table.

To use the functions check out the test-file provided below:

-- get depth

CALL sp_tree_depth('world_tree', 1, @a);
SELECT @a AS 'sublevels below Earth';

-- get all subnodes

CALL sp_tree_get_subnodes('world_tree', 210, 'res_tbl');
SELECT COUNT(*) AS 'subnodes of Germany' FROM res_tbl;
DROP TEMPORARY TABLE res_tbl;

-- get the parent

CALL sp_tree_get_parent('world_tree', 210, @id);
SELECT @id as 'parent_id of Germany';

-- delete all subnodes

BEGIN;
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany before delete';
CALL sp_tree_delete_subnodes('world_tree', 210);
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany after delete';
CALL sp_tree_get_subnodes('world_tree', 210, 'res_tbl');
SELECT COUNT(*) AS 'nodes of subtree Germany' FROM res_tbl;
DROP TEMPORARY TABLE res_tbl;
ROLLBACK;
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany rolled back';

-- delete node

BEGIN;
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany before delete';

CALL sp_tree_delete_subnodes('world_tree', 210);
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany after delete';
CALL sp_tree_get_subnodes('world_tree', 210, 'res_tbl');
SELECT COUNT(*) AS 'nodes of subtree Germany' FROM res_tbl;
DROP TEMPORARY TABLE res_tbl;

CALL sp_tree_delete_node('world_tree', 210);
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany after node-delete';

ROLLBACK;
CALL sp_tree_depth('world_tree', 210, @a);
SELECT @a AS 'tree depth of subtree Germany rolled back';

I have made up 3 files - the stored procedures itself sp_tree.sql - the conversion script to turn the world database into a tree structure - the tests which use the tree SPs on the world_tree table.

All the files are released under the MIT License with a Copyright in 2005, Jan Kneschke, meaning you are free to use it where ever you like as long as you include the copyright and the don't expect that this works.

Trees in SQL