Achieve better performance when doing data segmentation in MySQL. - 19 Jan 2013
Recently, I had to work on improving real-time data segmentation. We had a solution in place but the performance were not so great. In some cases, getting a response back was taking a couple of minutes. In this post, I will share with how I managed to improve this.
Here is the scenario we will use to explore this problem.
You have a list of users and track their activities. You want to segment your list of users based on their activities dynamically.
Table Structure
Here is the table structure we will be using for this post.
Users
id | username | firstname | lastname |
1 | vincent.lamanna | Vincent | Lamanna |
2 | john.doe | John | Doe |
3 | jane.doe | Jane | Doe |
Activities
id | user_id | timestamp | activity |
1 | 1 | 2012-12-01 08:00:00 | signup |
2 | 1 | 2012-12-01 08:02:00 | confirm_signup |
3 | 1 | 2012-12-01 08:03:00 | login |
4 | 2 | 2012-12-03 12:54:00 | signup |
5 | 3 | 2012-12-05 15:23:00 | signup |
6 | 3 | 2012-12-05 15:24:00 | confirm_signup |
7 | 1 | 2012-12-05 18:04:00 | login |
A Common Naive Approach
Let’s say we want to segment our list of users to get a list of users who have signed up but haven’t login or confirm their signup yet. A common approach I’ve seen people using would be the following query.
SELECT `id`, `username`
FROM `users`
WHERE `id` IN (
SELECT DISTINCT(`user_id`)
FROM `activities`
WHERE `activity` = 'signup'
) AND (
`id` NOT IN (
SELECT DISTINCT(`user_id`)
FROM `activities`
WHERE `activity` = 'login'
)
OR `id` NOT IN (
SELECT DISTINCT(`user_id`)
FROM `activities`
WHERE `activity` = 'confirm_signup'
)
)
The issue with this approach is that it uses sub-queries and causes MySQL to performance a full table scan on the main table you are segmenting. Here is the MySQL Explain you would get running the above query.
id | select_type | table | type | possible_keys | key | key_len | ref | rows |
1 | PRIMARY | users | ALL | NULL | NULL | NULL | NULL | 3 |
4 | DEPENDENT SUBQUERY | activities | index_subquery | user_id | user_id | 4 | func | 1 |
3 | DEPENDENT SUBQUERY | activities | index_subquery | user_id | user_id | 4 | func | 1 |
2 | DEPENDENT SUBQUERY | activities | index_subquery | user_id | user_id | 4 | func | 1 |
When working with a small to medium dataset, this is usually not a problem if you have a decent server. However, it can become a real bottleneck when you have a large dataset. Simple segments could take minutes to return a result. It can really become a problem when you want to achieve real-time data segmentation.
A Better Approach
After some iterations, here is the approach I found was best to achieve good performance on solving this problem.
SELECT u.`id`, u.`username`
FROM `users` u
LEFT JOIN `activities` q1_1 ON u.`id` = q1_1.`user_id` AND q1_1.`activity` = 'signup'
LEFT OUTER JOIN `activities` q1_2 ON (u.`id` = q1_2.`user_id` AND q1_2.`activity` = 'signup' AND
(q1_1.`id` < q1_2.`id` OR q1_1.`id` = q1_2.`id` AND q1_1.`id` < q1_2.`id`))
LEFT JOIN `activities` q2_1 ON u.`id` = q2_1.`user_id` AND q2_1.`activity` = 'confirm_signup'
LEFT OUTER JOIN `activities` q2_2 ON (u.`id` = q2_2.`user_id` AND q2_2.`activity` = 'confirm_signup' AND
(q2_1.`id` < q2_2.`id` OR q2_1.`id` = q2_2.`id` AND q2_1.`id` < q2_2.`id`))
LEFT JOIN `activities` q3_1 ON u.`id` = q3_1.`user_id` AND q3_1.`activity` = 'login'
LEFT OUTER JOIN `activities` q3_2 ON (u.`id` = q3_2.`user_id` AND q3_2.`activity` = 'login' AND
(q3_1.`id` < q3_2.`id` OR q3_1.`id` = q3_2.`id` AND q3_1.`id` < q3_2.`id`))
WHERE q1_2.`id` IS NULL AND q2_2.`id` IS NULL AND q3_2.`id` IS NULL
AND (q1_1.`id` IS NOT NULL AND (q2_1.`id` IS NULL OR q3_1.`id` IS NULL))
Looking at the MySQL Explain, we can see this is a much better approach.
id | select_type | table | type | possible_keys | key | key_len | ref | rows |
1 | PRIMARY | u | ALL | NULL | NULL | NULL | NULL | 3 |
1 | SIMPLE | q1_1 | ref | user_id | user_id | 4 | application.u.id | 1 |
1 | SIMPLE | q1_2 | ref | user_id | user_id | 4 | application.q1_1.user_id | 1 |
1 | SIMPLE | q2_1 | ref | user_id | user_id | 4 | application.q1_1.user_id | 1 |
1 | SIMPLE | q2_2 | ref | user_id | user_id | 4 | application.q1_1.user_id | 1 |
1 | SIMPLE | q3_1 | ref | user_id | user_id | 4 | application.q1_1.user_id | 1 |
1 | SIMPLE | q3_2 | ref | user_id | user_id | 4 | application.q1_1.user_id | 1 |
Here is how it works. First, we need to build the dataset we will be using to segment. For this part, we use double JOIN statements along with a WHERE statement. This is to ensure that if a record as more than 1 activity of the same time it is not duplicating the user in the final table.
LEFT JOIN `activities` q1_1 ON u.`id` = q1_1.`user_id` AND q1_1.`activity` = 'signup'
LEFT OUTER JOIN `activities` q1_2 ON (u.`id` = q1_2.`user_id` AND q1_2.`activity` = 'signup' AND
(q1_1.`id` < q1_2.`id` OR q1_1.`id` = q1_2.`id` AND q1_1.`id` < q1_2.`id`))
...
WHERE q1_2.`id` IS NULL
Once we have the dataset, we can then apply the segmentation rule using a where statement.
WHERE (q1_1.`id` IS NOT NULL AND (q2_1.`id` IS NULL OR q3_1.`id` IS NULL))
The reason why we do this instead of simple INNER JOIN statement or LEFT OUTER JOIN statement is to allow for complex segmentation rule such as the following.
(Criteria 1 OR Criteria 2) AND (Criteria3 OR Criteria 4) OR Criteria 5
To conclude this post, MySQL is great for doing real-time data segmentation. We just need to make sure we do it the proper way.
comments powered by Disqus