Tuesday, August 18, 2009

ORDER BY RAND Alternative

If you're reading this chances are you know everything you need to know about why using ORDER BY RAND() in MySQL is not such a good idea so I'll skip all of that and get right to the alternative.

Add a new column to your database table called rand_id and make it large enough to hold a UNIX timestamp. The index is extremely important. This table modification may take awhile if you have a large table.

ALTER TABLE my_table ADD `rand_id` INT NOT NULL ,
ADD INDEX(rand_id)

With the new column in place, setup a cron job which executes an SQL statement to update that column once a day (or more often if you wish). What this does is give each row a random ordering ID starting with the current time plus a random portion of the seconds in a day.

UPDATE my_table
SET rand_id = (UNIX_TIMESTAMP() + (RAND() * 86400))

To fetch a random row you use a query like this.

FROM my_table

Here's a few numbers from a test table with 1,572,851 rows. The server used isn't exactly optimized to be a database server either.

Selecting a random row using ORDER BY RAND() takes 37 seconds.
Executing the query to update the rand_id column takes 110 seconds.
Selecting a random row after the rand_id column is updated takes 0.0012 seconds.

Now if we needed a random row every second of every day, we would need 86,400 rows.

86,400 * 37 = 3,196,800 seconds spent selecting random rows. It would take more than a month to select a days worth or rows using ORDER BY RAND().

86,400 * 0.0012 = 103.68 seconds spent fetching random rows. If we add the 110 seconds needed to generate the rand_id column values that brings us to 213.68 seconds. That's less than 5 minutes spent fetching a days worth of random rows. Huge improvement over the ORDER BY RAND option.

Now, this solution can have a few serious drawbacks depending on your needs and table size.

The longer it has been since the rand_id column was generated, the more predictable the random row will be since you can make a note of the rows that have been selected throughout the day and narrow things down considerably.

Tables with less than roughly 90,000 rows are in danger of producing the same result multiple times in a row for seconds to minutes at a time.

No comments: