Getting statistics by period using mysql

One of the most common client request is to display statistics per day/week/month or other periods of time. This statistics might mean clicks, article views, user registered etc.

Let’s see how can we display statistics of article views by day, week and month using php and mysql.

1) The quick solution

Record each view in a table and display the statistics when needed:

CREATE TABLE `article_views` (
`id` INT NOT NULL AUTO_INCREMENT ,
`article_id` INT NOT NULL ,
`view_date` DATETIME NOT NULL,
PRIMARY KEY ( `id` ) ,
INDEX ( `article_id` )
) ENGINE = MYISAM

On the article page will use the following query:

INSERT INTO `article_views` (
`article_id` ,
`view_date`
)
VALUES (
'1', NOW( )
);

To display the statistics we will use:

For article views by day

SELECT COUNT(*) AS `views` , `article_id`
FROM `article_views`
WHERE `view_date`
BETWEEN NOW( ) - INTERVAL 1
DAY AND NOW( )
GROUP BY `article_id`
ORDER BY `views` DESC
LIMIT 10

For article views by week:

SELECT COUNT(*) AS `views` , `article_id`
FROM `article_views`
WHERE `view_date`
BETWEEN NOW( ) - INTERVAL 1
WEEK AND NOW( )
GROUP BY `article_id`
ORDER BY `views` DESC
LIMIT 10

For article views by month:

SELECT COUNT(*) AS `views` , `article_id`
FROM `article_views`
WHERE `view_date`
BETWEEN NOW( ) - INTERVAL 1
MONTH AND NOW( )
GROUP BY `article_id`
ORDER BY `views` DESC
LIMIT 10

The problem with this solution, is that as your site grows, the article_views table will also grows, making these queries on each statistics display is a serious performance hog. So we need to find another solution, to improve the performance.

2) Using another table to cache the article views

If you don’t need the statistics to be displayed “real time” you can set up a cron script to keep informations about article views:

CREATE TABLE `article_views_by_period` (
`id` INT NOT NULL AUTO_INCREMENT ,
`article_id` INT NOT NULL ,
`day_views` INT NOT NULL ,
`week_views` INT NOT NULL ,
`month_views` INT NOT NULL ,
PRIMARY KEY ( `id` ) ,
UNIQUE KEY ( `article_id` )
) ENGINE = MYISAM

article_id is declared unique so we can perform an INSERT…ON DUPLICATE KEY UPDATE query.

Insert a new article view is done like in the previous case (Replace 252 with your article_id):

INSERT INTO `article_views` (`article_id` ,`view_date`) VALUES (252, NOW( ));

Then, we make a cron to run the query:

TRUNCATE TABLE `article_views_by_period`;
INSERT INTO `article_views_by_period` (`article_id`,`day_views`,`week_views`,`month_views`)
SELECT `article_id` , SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 1
DAY AND NOW( ) , 1, 0 ) ) AS day_views, SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 7
DAY AND NOW( ) , 1, 0 ) ) AS week_views, SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 1
MONTH AND NOW( ) , 1, 0 ) ) AS month_views
FROM `article_views`
WHERE `view_date` BETWEEN NOW( ) - INTERVAL 1 MONTH AND NOW( )
GROUP BY `article_id`
ON DUPLICATE KEY
UPDATE `article_views_by_period`.`day_views`=VALUES(day_views),`article_views_by_period`.`week_views`=VALUES(week_views),`article_views_by_period`.`month_views`=VALUES(month_views);

(notice how we used SELECT SUM(IF…)) instead of SELECT COUNT(*), that way we only have to traverse the table only once)

The interval between cron runs, depend on how accurate you need your data to be.

To display the statistics you just need to run:

SELECT * FROM article_views_by_period ORDER BY month_views DESC LIMIT 10;
SELECT * FROM article_views_by_period ORDER BY week_views DESC LIMIT 10;
SELECT * FROM article_views_by_period ORDER BY day_views DESC LIMIT 10;

3) Optimizing some more

Although the presented solution, does save some database usage, the problem of too many records in the article_views table still remains, and your database might crash when the cron is runned.

So we will try to find a way to decrease the number of rows in the article_views table, by grouping the views together when inserting a record.

To do so, we add 2 more fields to the article_views table:

ALTER TABLE `article_views` ADD `period_check` INT UNSIGNED NOT NULL AFTER `view_date` ,
ADD `views` INT NOT NULL AFTER `period_check` ;

and create an unique index on  multiple columns (article_id and period_check).

ALTER TABLE `article_views` ADD UNIQUE (
`article_id` ,
`period_check`
);

Period_check field will holds information about the period when the view was added in the database.

For example, if we decide to run the cron, every hour, we could store the time of the hour when an article was added :

INSERT INTO `article_views` (article_id,view_date,period_check,views) VALUES (252, NOW(), FLOOR((NOW()+0)/10000),1 ) ON DUPLICATE KEY UPDATE views = views+1

When, another view for the same article is recorded in the next hour, the update part of the query is executed, and the views field will be incremented instead of adding another record into the database

If you want the period checking to be done by minute you could just replace:

FLOOR ((NOW() +0 ) /10000)

with:

FLOOR ((NOW() +0 ) /100)

The cron script will now run the following query:

TRUNCATE TABLE `article_views_by_period`;
INSERT INTO `article_views_by_period` (`article_id`,`day_views`,`week_views`,`month_views`)
SELECT `article_id` , SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 1
DAY AND NOW( ) , `views`, 0 ) ) AS day_views, SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 7
DAY AND NOW( ) , `views`, 0 ) ) AS week_views, SUM( IF( `view_date`
BETWEEN NOW( ) - INTERVAL 1
MONTH AND NOW( ) , `views`, 0 ) ) AS month_views
FROM `article_views`
WHERE `view_date` BETWEEN NOW( ) - INTERVAL 1 MONTH AND NOW( )
GROUP BY `article_id`
ON DUPLICATE KEY
UPDATE `article_views_by_period`.`day_views`=VALUES(day_views),`article_views_by_period`.`week_views`=VALUES(week_views),`article_views_by_period`.`month_views`=VALUES(month_views);

Displaying the statistics is made by quering the article_views_by_period cache table:

SELECT * FROM article_views_by_period ORDER BY month_views DESC LIMIT 10;
SELECT * FROM article_views_by_period ORDER BY week_views DESC LIMIT 10;
SELECT * FROM article_views_by_period ORDER BY day_views DESC LIMIT 10;

The advantage of this solution is that will reduce the number of rows stored in article_Views table. The disadvantage is: you cannot store other data specific to a visit (like ip of the visitors, or the referer page).

In conclusion, a statistics script in PHP/Mysql is a tradeoff between performance and accuracy. If you don’t expect too many visitors, but you want real-time statistics you could use the first solution. If, however you have big traffic in your site, i recommend using the optimized solutions.

Related posts:

  1. 10 things you should know when switching from Mysql to PostgreSQL

Leave a Reply