Common use cases for the MySQL Join statement

Demystifying the MySQL Join statement for building better datasets. - 29 Dec 2012

Since I started working as a software engineer, I’ve seen a lot of people misusing or trying to avoid MySQL Join statement. These statements can actually be quite handy when used properly.

In this post, I will demonstrate how to effectively use the Join statement with some common use cases. The scenario will be the following:

You have a list of users and track their activities. You want to build an internal reporting tool.

Table Structure

Here is the table structure we will be using for these common use cases.

Users

id username firstname lastname
1 vincent.lamanna Vincent Lamanna
2 john.doe John Doe
3 jane.doe Jane Doe

Activities

id user_id timestamp activity
1 1 2012-12-01 08:00:00 signup
2 1 2012-12-01 08:02:00 confirm_signup
3 1 2012-12-01 08:03:00 login
4 2 2012-12-03 12:54:00 signup
5 3 2012-12-05 15:23:00 signup
6 3 2012-12-05 15:24:00 confirm_signup
7 1 2012-12-05 18:04:00 login
8 3 2012-12-05 19:45:00 login

Use Case #1: Getting the list of users with their signup date

For this use case, all we have to do is select everything from the users table and use a left join with the activities table. Using a left join will ensure that if some users don’t have a signup activity they will still show up in the list with a null value for the signup date.

The most important thing when doing a join is to ensure that we have indexes on the column we use in the join. In this scenario, we have a primary index on users.id and an index on activity.user_id.

Here’s the query:

SELECT u.`id`, u.`username`, a.`timestamp` as `signup_date`
FROM `users` u
LEFT JOIN `activities` a
	ON u.`id` = a.`user_id`
	AND a.`activity` = 'signup'

Here’s the results set:

id username signup_date
1 vincent.lamanna 2012-12-01 08:00:00
2 john.doe 2012-12-03 12:54:00
3 jane.doe 2012-12-05 15:23:00

Here’s the explain:

id select_type table type possible_keys key key_len ref rows
1 SIMPLE u ALL NULL NULL NULL NULL 3
1 SIMPLE a ref user_id user_id 4 application.u.id 1

As you can see, our join used the index we defined. Having properly set indexes ensure optimal performance.

Use Case #2: Getting the list of users with their signup and confirmation date

This use case is very similar to our first use case. All we will have to add is a second left join to the same activities table but filtering the activity to be confirm_signup.

Here’s the query:

SELECT u.`id`, u.`username`, a1.`timestamp` as `signup_date`, a2.`timestamp` as `confirm_date`
FROM `users` u
LEFT JOIN `activities` a1
	ON u.`id` = a1.`user_id`
	AND a1.`activity` = 'signup'
LEFT JOIN `activities` a2
	ON u.`id` = a2.`user_id`
	AND a2.`activity` = 'confirm_signup'

Here’s the results set:

id username signup_date confirm_date
1 vincent.lamanna 2012-12-01 08:00:00 2012-12-01 08:02:00
2 john.doe 2012-12-03 12:54:00 NULL
3 jane.doe 2012-12-05 15:23:00 2012-12-05 15:24:00

Here’s the explain:

id select_type table type possible_keys key key_len ref rows
1 SIMPLE u ALL NULL NULL NULL NULL 3
1 SIMPLE a1 ref user_id user_id 4 application.u.id 1
1 SIMPLE a2 ref user_id user_id 4 application.u.id 1

This use case shows us how useful the left join can be. We can easily generate a report showing us the users who signed up but who haven’t confirmed their sign up.

Use Case #3: Getting the list of users with the timestamp and type of their last activity

This use case is a bit trickier. It’s also referenced as the greatest-n-per-group problem. In our scenario, it can be solved by joining the activities table twice. The first join is a normal left join but the second join is a left outer join. This will have the effect of discarding all the rows except the latest one for each user.

Here’s the query:

SELECT u.`id`, u.`username`, a1.`timestamp`, a1.`activity`
FROM `users` u
LEFT JOIN `activities` a1 ON u.`id` = a1.`user_id`
LEFT OUTER JOIN `activities` a2 ON (u.`id` = a2.`user_id` AND
	(a1.`timestamp` < a2.`timestamp` OR a1.`timestamp` = a2.`timestamp` AND a1.`id` < a2.`id`))
WHERE a2.`id` IS NULL

Here’s the results set:

id username timestamp activity
1 vincent.lamanna 2012-12-05 18:04:00 login
2 john.doe 2012-12-03 12:54:00 signup
3 jane.doe 2012-12-05 19:45:00 login

Here’s the explain:

id select_type table type possible_keys key key_len ref rows
1 SIMPLE u ALL NULL NULL NULL NULL 3
1 SIMPLE a1 ref user_id user_id 4 application.u.id 1
1 SIMPLE a2 ref PRIMARY,user_id,timestamp user_id 4 application.u.id 1

This double join can be very handy when generating reports. An interesting thing to note as well, is that the index that is being used on the second join is still the user_id.

Use Case #4: Getting the list of users with their total number of activities

This last use case is very simple and can be achieved using only one left join. You will also need to use the count aggregator as well as a group by to have a list of users.

Here’s the query:

SELECT u.`id`, u.`username`, COUNT(a.`id`) as `num_activities`
FROM `users` u
LEFT JOIN `activities` a
	ON u.`id` = a.`user_id`
GROUP BY u.`id`

Here’s the results set:

id username num_activities
1 vincent.lamanna 4
2 john.doe 1
3 jane.doe 3

Here’s the explain:

id select_type table type possible_keys key key_len ref rows
1 SIMPLE u index NULL PRIMARY NULL NULL 3
1 SIMPLE a ref user_id user_id 4 application.u.id 1

As you can see in this last use case, the explain is a bit different from the other use cases for the users table. It’s actually because of the group by which causes MySQL to use the primary key.

Creating a report with all of the above use cases

Now is the time to sum everything up into one report.

Here’s the query:

SELECT u.`id`, u.`username`,
	a1.`timestamp` as `signup_date`, a2.`timestamp` as `confirm_date`,
	a3.`timestamp` as `last_date`, a3.`activity` as `last_activity`,
	COUNT(a5.`id`) as `num_activities`
FROM `users` u
LEFT JOIN `activities` a1
	ON u.`id` = a1.`user_id`
	AND a1.`activity` = 'signup'
LEFT JOIN `activities` a2
	ON u.`id` = a2.`user_id`
	AND a2.`activity` = 'confirm_signup'
LEFT JOIN `activities` a3 ON u.`id` = a3.`user_id`
LEFT OUTER JOIN `activities` a4 ON (u.`id` = a4.`user_id` AND
	(a3.`timestamp` < a4.`timestamp` OR a3.`timestamp` = a4.`timestamp` AND a3.`id` < a4.`id`))
LEFT JOIN `activities` a5
	ON u.`id` = a5.`user_id`
WHERE a4.`id` IS NULL
GROUP BY u.`id`

Here’s the results set:

id username signup_date confirm_date last_date last_activity num_activities
1 vincent.lamanna 2012-12-01 08:00:00 2012-12-01 08:02:00 2012-12-05 18:04:00 login 4
2 john.doe 2012-12-03 12:54:00 NULL 2012-12-03 12:54:00 signup 1
3 jane.doe 2012-12-05 15:23:00 2012-12-05 15:24:00 2012-12-05 19:45:00 login 3

To conclude this post, don’t be afraid of using joins in MySQL but use them wisely and make sure your indexes are properly set by running explains on your queries.

MySQL
MySQL , Join
comments powered by Disqus