Real time data segmentation

Achieve better performance when doing data segmentation in MySQL. - 19 Jan 2013

Recently, I had to work on improving real-time data segmentation. We had a solution in place but the performance were not so great. In some cases, getting a response back was taking a couple of minutes. In this post, I will share with how I managed to improve this.

Here is the scenario we will use to explore this problem.

You have a list of users and track their activities. You want to segment your list of users based on their activities dynamically.

Table Structure

Here is the table structure we will be using for this post.

Users

id username firstname lastname
1 vincent.lamanna Vincent Lamanna
2 john.doe John Doe
3 jane.doe Jane Doe

Activities

id user_id timestamp activity
1 1 2012-12-01 08:00:00 signup
2 1 2012-12-01 08:02:00 confirm_signup
3 1 2012-12-01 08:03:00 login
4 2 2012-12-03 12:54:00 signup
5 3 2012-12-05 15:23:00 signup
6 3 2012-12-05 15:24:00 confirm_signup
7 1 2012-12-05 18:04:00 login

A Common Naive Approach

Let’s say we want to segment our list of users to get a list of users who have signed up but haven’t login or confirm their signup yet. A common approach I’ve seen people using would be the following query.

SELECT `id`, `username`
FROM `users`
WHERE `id` IN (
	SELECT DISTINCT(`user_id`)
	FROM `activities`
	WHERE `activity` = 'signup'
) AND (
	`id` NOT IN (
		SELECT DISTINCT(`user_id`)
		FROM `activities`
		WHERE `activity` = 'login'
	)
	OR `id` NOT IN (
		SELECT DISTINCT(`user_id`)
		FROM `activities`
		WHERE `activity` = 'confirm_signup'
	)
)

The issue with this approach is that it uses sub-queries and causes MySQL to performance a full table scan on the main table you are segmenting. Here is the MySQL Explain you would get running the above query.

id select_type table type possible_keys key key_len ref rows
1 PRIMARY users ALL NULL NULL NULL NULL 3
4 DEPENDENT SUBQUERY activities index_subquery user_id user_id 4 func 1
3 DEPENDENT SUBQUERY activities index_subquery user_id user_id 4 func 1
2 DEPENDENT SUBQUERY activities index_subquery user_id user_id 4 func 1

When working with a small to medium dataset, this is usually not a problem if you have a decent server. However, it can become a real bottleneck when you have a large dataset. Simple segments could take minutes to return a result. It can really become a problem when you want to achieve real-time data segmentation.

A Better Approach

After some iterations, here is the approach I found was best to achieve good performance on solving this problem.

SELECT u.`id`, u.`username`
FROM `users` u
LEFT JOIN `activities` q1_1 ON u.`id` = q1_1.`user_id` AND q1_1.`activity` = 'signup'
LEFT OUTER JOIN `activities` q1_2 ON (u.`id` = q1_2.`user_id` AND q1_2.`activity` = 'signup' AND
	(q1_1.`id` < q1_2.`id` OR q1_1.`id` = q1_2.`id` AND q1_1.`id` < q1_2.`id`))
LEFT JOIN `activities` q2_1 ON u.`id` = q2_1.`user_id` AND q2_1.`activity` = 'confirm_signup'
LEFT OUTER JOIN `activities` q2_2 ON (u.`id` = q2_2.`user_id` AND q2_2.`activity` = 'confirm_signup' AND
	(q2_1.`id` < q2_2.`id` OR q2_1.`id` = q2_2.`id` AND q2_1.`id` < q2_2.`id`))
LEFT JOIN `activities` q3_1 ON u.`id` = q3_1.`user_id` AND q3_1.`activity` = 'login'
LEFT OUTER JOIN `activities` q3_2 ON (u.`id` = q3_2.`user_id` AND q3_2.`activity` = 'login' AND
	(q3_1.`id` < q3_2.`id` OR q3_1.`id` = q3_2.`id` AND q3_1.`id` < q3_2.`id`))
WHERE q1_2.`id` IS NULL AND q2_2.`id` IS NULL AND q3_2.`id` IS NULL
AND (q1_1.`id` IS NOT NULL AND (q2_1.`id` IS NULL OR q3_1.`id` IS NULL))

Looking at the MySQL Explain, we can see this is a much better approach.

id select_type table type possible_keys key key_len ref rows
1 PRIMARY u ALL NULL NULL NULL NULL 3
1 SIMPLE q1_1 ref user_id user_id 4 application.u.id 1
1 SIMPLE q1_2 ref user_id user_id 4 application.q1_1.user_id 1
1 SIMPLE q2_1 ref user_id user_id 4 application.q1_1.user_id 1
1 SIMPLE q2_2 ref user_id user_id 4 application.q1_1.user_id 1
1 SIMPLE q3_1 ref user_id user_id 4 application.q1_1.user_id 1
1 SIMPLE q3_2 ref user_id user_id 4 application.q1_1.user_id 1

Here is how it works. First, we need to build the dataset we will be using to segment. For this part, we use double JOIN statements along with a WHERE statement. This is to ensure that if a record as more than 1 activity of the same time it is not duplicating the user in the final table.

LEFT JOIN `activities` q1_1 ON u.`id` = q1_1.`user_id` AND q1_1.`activity` = 'signup'
LEFT OUTER JOIN `activities` q1_2 ON (u.`id` = q1_2.`user_id` AND q1_2.`activity` = 'signup' AND
	(q1_1.`id` < q1_2.`id` OR q1_1.`id` = q1_2.`id` AND q1_1.`id` < q1_2.`id`))
...
WHERE q1_2.`id` IS NULL

Once we have the dataset, we can then apply the segmentation rule using a where statement.

WHERE (q1_1.`id` IS NOT NULL AND (q2_1.`id` IS NULL OR q3_1.`id` IS NULL))

The reason why we do this instead of simple INNER JOIN statement or LEFT OUTER JOIN statement is to allow for complex segmentation rule such as the following.

(Criteria 1 OR Criteria 2) AND (Criteria3 OR Criteria 4) OR Criteria 5

To conclude this post, MySQL is great for doing real-time data segmentation. We just need to make sure we do it the proper way.

MySQL
MySQL , Join , Segmentation
comments powered by Disqus