Using aggregates in Joins – MySQL

Recently, I jumped into an issue in my mysql query. It involved an inner join, group by clause, and I had to use aggregate functions to summarize my result. Little did I notice where I was landing into, until I saw the unexpected huge numbers appearing in my aggregate result. Let me expain!

Suppose we have a users, posts, and post_comments.

users and posts have one-to-many relationship, i.e, a user can post any number of posts and a post belongs to a single user.

posts and post_comments too have one-to-many relationship, i.e, a post can have any number of post_comments and a post_comment belongs to a single post.

mysql> desc users;
+-------+--------------+------+-----+---------+----------------+
| Field | Type         | Null | Key | Default | Extra          |
+-------+--------------+------+-----+---------+----------------+
| id    | mediumint(9) | NO   | PRI | NULL    | auto_increment |
| name  | varchar(50)  | YES  |     | NULL    |                |
+-------+--------------+------+-----+---------+----------------+

mysql> desc posts;
+----------+--------------+------+-----+---------+----------------+

| Field    | Type         | Null | Key | Default | Extra          |

+----------+--------------+------+-----+---------+----------------+

| id       | mediumint(9) | NO   | PRI | NULL    | auto_increment |

| user_id  | mediumint(9) | YES  | MUL | NULL    |                |

| title    | varchar(50)  | YES  |     | NULL    |                |

| content  | text         | YES  |     | NULL    |                |

| no_words | int(11)      | YES  |     | NULL    |                |

+----------+--------------+------+-----+---------+----------------+


mysql> desc post_comments;

+-------------+--------------+------+-----+---------+----------------+

| Field       | Type         | Null | Key | Default | Extra          |

+-------------+--------------+------+-----+---------+----------------+

| id          | mediumint(9) | NO   | PRI | NULL    | auto_increment |

| post_id     | mediumint(9) | YES  | MUL | NULL    |                |

| comment     | text         | YES  |     | NULL    |                |

| post_rating | int(11)      | YES  |     | NULL    |                |

+-------------+--------------+------+-----+---------+----------------+

Assume, they have the following values.


mysql> select * from users;
+----+------+
| id | name |
+----+------+
|  1 | ABC  |
|  2 | DEF  |
+----+------+

mysql> select * from posts;
+----+---------+----------------+------------------------------------------------------------+----------+
| id | user_id | title          | content                                                    | no_words |
+----+---------+----------------+------------------------------------------------------------+----------+
|  1 |       1 | Welcome Post   | This is an introductory post                               |       27 |
|  2 |       1 | MySQL          | MySQL is the most popular open source database             |       46 |
|  3 |       2 | Basic commands | CREATE, INSERT, UPDATE, DELETE, DESTROY... to mention some |       58 |
+----+---------+----------------+------------------------------------------------------------+----------+

mysql> select * from post_comments;
+----+---------+--------------------------------------------------+-------------+
| id | post_id | comment                                          | post_rating |
+----+---------+--------------------------------------------------+-------------+
|  1 |       1 | You ve got an awesome blog!                      |          10 |
|  2 |       1 | Keep writing more such stuffs !                  |           8 |
|  3 |       2 | I just got my hand on this! Thanks for the intro |           7 |
|  4 |       2 | We already have Sql, why MySQL?                  |           4 |
|  5 |       3 | These commands work like a charm! :)             |           8 |
+----+---------+--------------------------------------------------+-------------+

Now, we need to find out:

For each user, no. of posts, no. of comments, total rating received, total no_words written.

For this, we would join the posts and the post_comments table, group by user_id in the posts table, and use aggregate functions to find the sum and count. Let’s shoot! (ASSUME THIS AS OUR BASE QUERY FOR ALL FURTHER DISCUSSIONS)

 

mysql> select P.user_id, count(P.id) no_posts, count(PC.id) no_comments, sum(post_rating), sum(no_words) 
from post_comments PC inner join posts P on P.id = PC.post_id group by P.user_id;
+---------+----------+-------------+------------------+---------------+
| user_id | no_posts | no_comments | sum(post_rating) | sum(no_words) |
+---------+----------+-------------+------------------+---------------+
|       1 |        4 |           4 |               29 |           146 |
|       2 |        1 |           1 |                8 |            58 |
+---------+----------+-------------+------------------+---------------+
2 rows in set (0.00 sec)

 

As you can see, we have joined posts and post_comments table by post_id, grouped by user_id, and used aggregate functions. Let’s examine our result now.

According to the first row, User id 1 has:

1. written 4 posts – Incorrect!!! User 1 has written just 2 posts, with id as 1 and 2 in post table!

2. got 4 comments – Correct! User 1 has 4 comments, 2 each for post_id 1 and 2. Check the post_comments table!

3. total post_rating of 29 – Correct! Sum all the ratings of the comments he has received. (10+8+7+4=29).

4. no_words written is 146 – Incorrect!!! Total no_words written by User id 1 is 73 (27 + 46) for post id 1 and 2 respectively.

So, we have 2 correct values and 2 incorrect values for each row!

If you would have noticed, the correct values, viz, no_comments and post_rating are from the post_comments table(child table), and the incorrect values, viz, no_posts and no_words are from the posts table…This is not a mere coincidence (I will come back to this in the later part of the post) !!!

Let’s understand the query we made. In the base query, if you remove the group by clause and the aggregates used in the Select clause, we can see how our result looked like, and on what basis the aggregates were done.

 

mysql> select P.user_id,  P.id post_id, PC.id comment_id, post_rating, no_words 
from post_comments PC inner join posts P on P.id = PC.post_id;
+---------+---------+------------+-------------+----------+
| user_id | post_id | comment_id | post_rating | no_words |
+---------+---------+------------+-------------+----------+
|       1 |       1 |          1 |          10 |       27 |
|       1 |       1 |          2 |           8 |       27 |
|       1 |       2 |          3 |           7 |       46 |
|       1 |       2 |          4 |           4 |       46 |
|       2 |       3 |          5 |           8 |       58 |
+---------+---------+------------+-------------+----------+
5 rows in set (0.01 sec)

What we have is a row for each post_comments, and the corresponding post_id, user_id and no_words used for the post. So, no record of the child table(post_comments) is duplicated as you can see, but records from the parent table(posts) have duplicates.

Hence, I mentioned, this is no mere coincidence. This brings us to a very important note, that you must always remember.

If Table A has one-to-many relation with Table B, and Table A is JOINED to Table B, then Table A will have duplicate rows in the result.

When we do a group by post id, what it does is groups all the records that have the same post id, and performs the aggregate functions on each of them.

Now, returning to our base query, we did count(P.id) to get the no_posts written by the user. count(column) returns the count for those rows whose column is not null, it does not check if it is the same post or not. So, user 1 has 4 rows in the above table, and each row has a value for P.id, i.e. P.id is not null. Hence, we got no_posts as 4 for User 1 in our result.

As you might have guessed, we can fix this by using a count(distinct P.id). Let’s check.

Modify our base query, to include distinct P.id

mysql> select P.user_id, count(distinct P.id) no_posts, count(PC.id) no_comments, sum(post_rating), 
sum(no_words) from post_comments PC inner join posts P on P.id = PC.post_id group by P.user_id;
+---------+----------+-------------+------------------+---------------+
| user_id | no_posts | no_comments | sum(post_rating) | sum(no_words) |
+---------+----------+-------------+------------------+---------------+
|       1 |        2 |           4 |               29 |           146 |
|       2 |        1 |           1 |                8 |            58 |
+---------+----------+-------------+------------------+---------------+

Yo, that worked! So, we have fixed the no_posts issue. We now have to fix the no_words sum. You might be wondering we can do the same SUM(distinct no_words). Let’s try doing that.

mysql> select P.user_id, count(distinct P.id) no_posts, count(PC.id) no_comments,sum(post_rating), 
sum(distinct no_words) from post_comments PC inner join posts P on P.id = PC.post_id 
group by P.user_id;
+---------+----------+-------------+------------------+------------------------+
| user_id | no_posts | no_comments | sum(post_rating) | sum(distinct no_words) |
+---------+----------+-------------+------------------+------------------------+
|       1 |        2 |           4 |               29 |                     73 |
|       2 |        1 |           1 |                8 |                     58 |
+---------+----------+-------------+------------------+------------------------+

Seems correct, huh? The number for no_words is as expected, 73. But, logically this is wrong. I’ll explain. Insert another post, which has no_words same as one of the existing post of user 1, and add a comment for it.

mysql> insert into posts (user_id,title,content,no_words) values (1,'Test post','This is a test post........',27);
mysql> insert into post_comments (post_id,comment,post_rating) values (4,'This is a test comment',10);

mysql> select * from posts;
+----+---------+----------------+------------------------------------------------------------+----------+
| id | user_id | title          | content                                                    | no_words |
+----+---------+----------------+------------------------------------------------------------+----------+
|  1 |       1 | Welcome Post   | This is an introductory post                               |       27 |
|  2 |       1 | MySQL          | MySQL is the most popular open source database             |       46 |
|  3 |       2 | Basic commands | CREATE, INSERT, UPDATE, DELETE, DESTROY... to mention some |       58 |
|  4 |       1 | Test post      | This is a test post........                                |       27 |
+----+---------+----------------+------------------------------------------------------------+----------+

mysql> select * from post_comments;
+----+---------+--------------------------------------------------+-------------+
| id | post_id | comment                                          | post_rating |
+----+---------+--------------------------------------------------+-------------+
|  1 |       1 | You ve got an awesome blog!                      |          10 |
|  2 |       1 | Keep writing more such stuffs !                  |           8 |
|  3 |       2 | I just got my hand on this! Thanks for the intro |           7 |
|  4 |       2 | We already have Sql, why MySQL?                  |           4 |
|  5 |       3 | These commands work like a charm! :)             |           8 |
|  6 |       4 | This is a test comment                           |          10 |
+----+---------+--------------------------------------------------+-------------
+

 The newly inserted post is by user 1 and has no_words as 27(same as post with id 1). Let’s do the query with Sum(distinct no_words) again, and verfiy our result.

mysql> select P.user_id, count(distinct P.id) no_posts, count(PC.id) no_comments, sum(post_rating), 
sum(distinct no_words) from post_comments PC inner join posts P on P.id = PC.post_id group by P.user_id;

+---------+----------+-------------+------------------+------------------------+
| user_id | no_posts | no_comments | sum(post_rating) | sum(distinct no_words) |
+---------+----------+-------------+------------------+------------------------+
|       1 |        3 |           5 |               39 |                     73 |
|       2 |        1 |           1 |                8 |                     58 |
+---------+----------+-------------+------------------+------------------------+

According to the result, User 1 now has:

3 posts – Correct

5 comments – Correct

39 total post_rating – Correct (10+8+7+4+10=39)

73 no_words – INCORRECT!!!! – it should have been 27+46+27 = 100.

So, SUM(distinct no_words) will not work for us. But why? This is a very frequent mistake most of us do. Sum(distinct no_words) would remove any duplicate values, i.e., 27+46+27 will become 27+46. Since 27 is repeated twice. But they belong to two different posts, Sum(distinct) does not consider this fact.

Let’s now move on to our final solution. I’m sure there must be many workarounds, but one of which I learnt to use is using derived tables. (Please feel free to comment on improvised solutions).

Let’s break down our queries, starting from our child table(Always use it as a derived table). We shall consider the grouping by user_id in the end.

mysql> select post_id, count(*) no_comments, sum(post_rating) total_rating from post_comments 
group by post_id;

+---------+-------------+--------------+
| post_id | no_comments | total_rating |
+---------+-------------+--------------+
|       1 |           2 |           18 |
|       2 |           2 |           11 |
|       3 |           1 |            8 |
|       4 |           1 |           10 |
+---------+-------------+--------------+

This gives us a row for each post, the no_comments and the total_rating. Now, we shall use the above query as our derived table.

mysql> select P.id post_id, P.user_id, P.no_words, d.no_comments, d.total_rating 
from posts P inner join 
  (
    select post_id, count(*) no_comments, sum(post_rating) total_rating 
    from post_comments group by post_id
  ) d 
on P.id = d.post_id;

+---------+---------+----------+-------------+--------------+
| post_id | user_id | no_words | no_comments | total_rating |
+---------+---------+----------+-------------+--------------+
|       1 |       1 |       27 |           2 |           18 |
|       2 |       1 |       46 |           2 |           11 |
|       3 |       2 |       58 |           1 |            8 |
|       4 |       1 |       27 |           1 |           10 |
+---------+---------+----------+-------------+--------------+
4 rows in set (0.00 sec)

 

This result has given us all the details for each post, note the virtual primary key here is the post id. Now, all that we have to do is, group by user_id and use aggregate functions to summarize the fields.

mysql> select P.user_id, count(*) no_posts, sum(no_words) no_words, 
sum(d.no_comments) no_comments, sum(d.total_rating) no_rating 
from posts P inner join 
  (
    select post_id, count(*) no_comments, sum(post_rating) total_rating 
    from post_comments group by post_id
   ) d 
on P.id = d.post_id group by P.user_id;

+---------+----------+----------+-------------+-----------+
| user_id | no_posts | no_words | no_comments | no_rating |
+---------+----------+----------+-------------+-----------+
|       1 |        3 |      100 |           5 |        39 |
|       2 |        1 |       58 |           1 |         8 |
+---------+----------+----------+-------------+-----------+

 

Let’s check our result: User 1 has:

3 posts – Correct

100 no_words – Correct

5 comments – Correct

39 rating – Correct

Whoi! We have finally got our correct result!

Next time you are using a join for a one – to -many relationship table, be careful while you group by and summarize columns using aggregate.

Notes:

1. The parent table will always have duplicate rows, and when you try aggregating the columns of the parent table, it will lead into very huge numbers.

2. Never use SUM(distinct column), except if your sure of what you are doing. It may lead to unexpected results.

Hope you enjoyed reading the post, as much as I enjoyed writing it. 🙂

Cheers,

Aishwarya 🙂 😀

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s