Webserver Analytics

The skankworks.net uses open-source software or we develop our own. In this series of articles we will develop a LAMP application for real-time analytics of web-server usage statistics.

We do not use third-party services that would do this analysis for us because that would require us to give out information about our visitors. Instead, we analyze our website the traditional way – by writing our own software. A few basic real-time statistics have been put up already as an example of where we are headed. The tables shows the number of hits on popular pages on the site historical and recently, filtered to remove false-positives such as access from bots and other non-user activity.

The tables can be refreshed with up-to-date information by clicking the titles. The “Most Popular” table reads from a pre-calculated table and is very fast, in comparision to the “Recent” table which scans the entire logs filtering by date. The numbers are coming from the Apache logfile which are captured and piped into a MySQL database. We use a seperate instance of MySQL running on a dedicated server and remote syslog to transfer the logging. On this host we have a script tailing the logfile, processing and adding each line of logging into the database as it arrives. The front end consists of Ajax-driven php scripts permissioned to call certain stored procedures on the remote DB.

The software is currently in alpha, and thus lacks many front-end features, and the back-end has not been optimized for performance. A rough and ready guide to the set-up follows. In follow up post we shall continue development and optimisation of a real-time analytics application. We will not cover infrastructure and thus assume that an operational LAMP website with remote logging is already in place.

Part 1 – Database Set-up

We assume throughout that Apache is set to the following log-file format:

LogFormat “%h %l %u %t \”%r\” %>s %O \”%{Referer}i\” \”%{User-Agent}i\”” combined

We recommend running the statstics database on a dedicated host so as to avoid imposing any load on the web-server. Where that is not possible this can of course be set-up on single hosts alongside other databases.

To set-up the database login to MySQL and create a database for the logfiles:

CREATE DATABASE apachelog;

Create the tables:
We define three tables, only two of which are currently used.

errors – unused
hits – the main table where log entries have been broken down to their components. It is a placeholder table for early development and will later be broken down into a set of normalised sub-tables related by foriegn keys.
hit_count – populated by a trigger on inserts to hits this table keeps an up-to-date tally of every url visited and the time of the last visit.

— Database: `apachelog`

— ——————————————————–


— Table structure for table `errors`

CREATE TABLE `errors` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`line` varchar(1024) NOT NULL,
UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

— ——————————————————–


— Table structure for table `hits`

CREATE TABLE `hits` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`host` varchar(128) NOT NULL,
`field2` varchar(32) DEFAULT NULL,
`field3` varchar(32) DEFAULT NULL,
`timestr` varchar(32) NOT NULL,
`timediffstr` varchar(12) NOT NULL,
`command` varchar(128) NOT NULL,
`url` varchar(1024) NOT NULL,
`protocol` varchar(24) NOT NULL,
`field9` varchar(64) DEFAULT NULL,
`code` varchar(12) NOT NULL,
`bytes` varchar(12) NOT NULL,
`referrer` varchar(64) DEFAULT NULL,
`field13` varchar(64) DEFAULT NULL,
`agentstr` varchar(512) NOT NULL,
`field15` varchar(64) NOT NULL,
`field16` varchar(64) NOT NULL,
`field17` varchar(64) NOT NULL,
`field18` varchar(64) NOT NULL,
`field19` varchar(64) NOT NULL,
`field20` varchar(64) DEFAULT NULL,
`md5hash` varchar(33) NOT NULL,
`time` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_2` (`md5hash`,`host`,`timestr`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

— ——————————————————–

CREATE TABLE `hit_count` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`md5hash` varchar(33) NOT NULL,
`url` varchar(1024) NOT NULL,
`count` int(11) NOT NULL DEFAULT ‘1’,
`last_time` int(11) NOT NULL DEFAULT ‘1’,
UNIQUE KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

Create Triggers

CREATE TRIGGER `update_hit_count` AFTER INSERT ON `hits`
FOR EACH ROW
BEGIN
IF NOT EXISTS
(SELECT 1 FROM `hit_count` WHERE md5hash = NEW.md5hash)
THEN
INSERT INTO hit_count (md5hash, url, last_time) VALUES (NEW.md5hash, NEW.url, NEW.time);
ELSE
UPDATE hit_count
SET count = count + 1, last_time = NEW.time — TODO: ensure NEW.time > last_time, if not only increment count
WHERE md5hash = NEW.md5hash;
END IF;
END;

Create the Stored Procedures

most_popular – selects the 25 urls with the most hits, ordered from highest to lowest
referrers – selects the top 25 urls, ignoring hits with no referer set or that come from internal links

CREATE PROCEDURE `page_hits`(
IN theUrl varchar(1024)
)
BEGIN
SELECT count FROM hit_count WHERE url = theUrl;
END

CREATE PROCEDURE `most_popular`()
BEGIN
select url, count from hit_count order by count desc limit 0, 25;
END

CREATE PROCEDURE `referrers`()
BEGIN
SELECT referrer, count(referrer) AS number
from hits
where referrer not like “[your host]%”
and referrer != “/robots.txt”
GROUP BY referrer ORDER BY number DESC,referrer limit 1,25;
END

— insert into db
CREATE PROCEDURE add_log_entry
(
IN host varchar(128),
IN field2 varchar(32),
IN field3 varchar(32),
IN timestr varchar(32),
IN timediffstr varchar(12),
IN command varchar(128),
IN url varchar(1024),
IN protocol varchar(24),
IN field9 varchar(64),
IN code varchar(12),
IN bytes varchar(12),
IN referrer varchar(64),
IN field13 varchar(64),
IN agentstr varchar(512),
IN field15 varchar(64),
IN field16 varchar(64),
IN field17 varchar(64),
IN field18 varchar(64),
IN field19 varchar(64),
IN field20 varchar(64),
IN md5hash varchar(33),
IN time int(11))
BEGIN
INSERT INTO hits
(
host, field2, field3, timestr, timediffstr, command, url, protocol, field9, code, bytes, referrer, field13, agentstr, field15, field16, field17, field18, field19, field20, md5hash, time
)
VALUES
(
host, field2, field3, timestr, timediffstr, command, url, protocol, field9, code, bytes, referrer, field13, agentstr, field15, field16, field17, field18, field19, field20, md5hash, time
);
END

Create the users and grant privielges.

w_example – this is the maintenance user that will be given full privileges on the database. It’s primary purpose if for the loading of data. This will only be used by back-end scripts that cannot be accessed by the webserver.
r_example – a restricted user that is used by the scripts called by Ajax from webserver to generate the statisics tabs.

You should also set-up a third user as the database owner with all priviledges, only to be used for maintenance logins.

CREATE USER ‘w_example’@’example_host’ identified by ‘strong_password’
CREATE USER ‘r_example’@’example_host’ identified by ‘an_even_stronger_password’

— read user
GRANT EXECUTE on procedure referrers to r_example@example.host;
GRANT EXECUTE on procedure most_popular to r_example@example.host;
GRANT EXECUTE on procedure page_hits to r_example@example.host;
GRANT SELECT on table apachelog.hits to r_example@example.host;

— write user
GRANT EXECUTE on procedure add_log_entry to w_example@example.host;

Verification
Once loaded with data the following two statements should always output equal values:

SELECT SUM(count) FROM hit_count;
SELECT count(*) FROM hits;

As should these:

SELECT COUNT(distinct(md5hash)) FROM hits;
SELECT COUNT(*) FROM hit_count;

Usage
Such a database can already be put to many uses, for example the hits count in the skankworks.net masthead above. These can be added very easily into any webpage or blog. This feature currently calls a php script which calls a stored procedure to retrieve the hit_count everytime a page is loaded. This is optimized by pre-calculating the hitcount with the update_hit_count trigger that populates a dedicated hit-count table everytime a page is viewed. Later we will update this field in the browser by using an Ajax function called when the page is loaded. This will allow very fast look-ups of key page metrics with minimal effect on page loading time.

In Part 2 we will show scripts that can filter by date, response code, or host, scripts to populate the database with log entries from existing archive files, and add real-time updates. Part 3 will show the Ajax-enabled front-end development and the fourth and final part will involve re-engineering for performance.

At the skankworks.net our development philosophy is one of continuous delivery, so we like to get working software published early in the SDLC. We plan to develop our analytics openly and welcome any feedback from the webmastering community. Our initial engineering focus will be on security, thus none of the early versions of the app will accept any user input other than mouse clicks to pre-defined parameterless procedures.

Our secondary focus will be on maintainability. Early versions of the application allow us to analyze where the bottlenecks occur and thus focus later efforts on performance. We design all parts of the software to be modular and replaceable to facilite scalability and requirement changes. We do not use agile methods.

The code that will be developed in this series of posts remains copyright to the skankworks.net. It may be freely used or redistributed so long as it contains a link back to “www.skankworks.net”.

Feel free to use the comments to make suggestions.

This entry was posted in General IT and tagged , . Bookmark the permalink.