Real-Time Log Analysis

Apache Log Analysis in Real-Time

We’re not keen on external services that require the webmaster to embed some third-party javascript into their pages. For a number of reasons, so we’re writing our own.

Here’s our latest progress update

Capacity Planning

We have written a Java program on the backend that reads the apache combined log using tail and then parses each line and inserts into a database in fourth normal form. Our DB host is nothing special, a dual-core Intel with 8GB of RAM running mysql. All servers run Ubuntu Linux.

Our backend is currently maxing out processing around eight-thousand raw hits per second. To give a sense of scale, Wikipedia’s web-servers receive an average of fifty-thousand hits per second. The typical commercial website would be lucky to receive eight-thousand hits per day.

We run the Java instance on an eight-core host, running three parsing threads to prepare the SQL and six inserter threads to inject data to the database. With some more powerful hardware, such as a DB cluster of four eight-core servers, our application could easily handle wikipedia’s traffic.

Here’s a screen shot of our latest update on the test infrastructure. Taken while running in bulk mode directly from Eclipse during the loading of historical data. In real-time mode the GUI would not be displayed as it is designed to run headless (and the queues can be much shorter as web traffic is unlikey to fill such large queues as shown here).

apache log file analysis loading

The figures are for the number of incoming log lines, not rows, inserted into the DB. Each line requires five tables to be updated while keeping foriegn key relations intact. So we are running our MySQL database on a dual core host and getting a rate of forty-thousand rows updated per second.

The majority of updates are INSERTS, with about twenty-five-percent being UPDATES.

This is in addition to an equivalant number of selects taking place concurrently.

If anybody would like to suggest a method of increasing this performance, without using LOAD DATA INFILE, we'd like to hear it. Use the comment form below, or tweet us a link. 

We do not need anything like this level of performance in our own production environment, so we have introduced rate-limiters into the code in order to tune it to the level of performance we require and save on resources.

In production we limit the java application to using no more than 5% of CPU cycles and 1GB of RAM, allowing the application to happily run alongside other apps. If we wanted the same performance as we get in our load tests, we can give the app more resources and increase the rate limits.

To achieve such high-performance we have avoided the use of third-party Java libraries as much as possible and written everything ourselves, customized to the job in hand. We have implemented several internal caches, for example, to maximize performance, along with prepared statements, batch inserts, and queue pools for internal communication.

Aggregating Caches

Without the caching tricks we developed, the forty-thousand updates described above would have required one-hundred-thousand INSERT statements. Hence we could argue that our design is making the equivalent of 100k inserts per second.

But we cheated.

We implemented an aggregating cache that eliminates sixty-percent of them.

We settled on a cache based on an array of hashes, after experimenting with various forms of linked-lists, with one cache for each key-relation. We periodically flush them to the database to prevent them growing too large. To discover what “too large” means, we tested the caching under various loads with precision timing.

We also use a couple of other tricks to maintain cache integrity and performance after flushing. It would not help if we had to rebuild the cache after every flush.

We could gain a small performance boost by replacing each hash in the array with a linked-list, and implement a strategy that keeps the most accessed items at the head of the lists. But this would be a lot of effort for diminishing returns.

It’s something to consider though.

Load Testing

Extensive load testing at every stage of development is the real key to performance. If you leave the load testing until you have ‘working software’, the chances are you will never have working software. At least, not software that works well.

Performance problems that only show up after you go into production are likely to be the most expensive to fix requiring extensive refactoring.

To enable such testing, we created the following simple timer class and used it to measure everything to the microsecond under controlled conditions. It’s designed to be invoked around the inner-most loops but can be used anywhere you wish to measure the execution time of a particular code block.

package apacheLog;

public class Timers {
	private long startMilli = 0L;
	private long endMilli = 0L;

	Timers() {
		this.reset();
	}
	
	public void start(){
		this.startMilli = System.currentTimeMillis();
	}
	
	public void stop(){
		this.endMilli = System.currentTimeMillis();
	}
	
	public long elapsedMillis(){
		if(endMilli < startMilli){
			return -1L;
		} else {
			return (this.endMilli - this.startMilli); 
		}
	}
	
	public String toString(){
		StringBuilder out = new StringBuilder("");
		out.append("Milli Start ");
		out.append(String.valueOf(this.startMilli));
		out.append(" Milli End ");
		out.append(String.valueOf(this.endMilli));
		return out.toString();
	}
	public void reset(){
		startMilli = 0L;
		endMilli = 0L;
	}

	public long getStartMilli() {
		return startMilli;
	}

	public void setStartMilli(long startMilli) {
		this.startMilli = startMilli;
	}

	public long getEndMilli() {
		return endMilli;
	}

	public void setEndMilli(long endMilli) {
		this.endMilli = endMilli;
	}
}

As the code is developed we add several instances of this class to measure statistics over the lifetime of entire objects. Take this example from one of our 'Insertion' threads.

 400,000 hits processed in 540.187 seconds, 740 lines per second
 Thread execution time  : 540.187
 Time reading input     : 9.444
 Time in cache          : 113.410
 Time inserting         : 303.775

This revealed to us, for example, that the 'sweet spot' for a batch insert is at 1,000 statements and that we should flush the application's internal caches to the database every twenty-thousand lines.

These numbers come not just from the internal code design, but also the hardware it is running on. So we made the batch and cache size paramaters, along with many others, configurable at run-time with an external properties file.

This allows the finished application to be fine-tuned for performance to match the local hardware capabilities or throughput requirements. The application adminstrator may experiment to find the best settings in any given circumstance.

Architecture

The basic design pattern throughout is Producer-Consumer. This is not by choice. The application is essentially an extract-transform-load (ETL). With the various components able to process at different rates, a queue-connected producer-consumer design suggests itself as the most efficient in the circumstances.

We came across this design the hard way, naturally, by having first tried several other wireframe designs each of which was an improvement on the one before. This relates to the point made above, that deserves emphasis, which is to start testing for performance as early in the SDLC as you can.

Once the design decision was made we added a feeder class to the front to allow for multiple producers, and producers may have multiple consumers, fanning out into a tree-structure. We connect all parts with queues. Each connection has a minimum two queues, one to carry the data payload, another for sending control and status messages.

The payload queues are pooled, allowing each object to run queues in parallel if desired. Every object, feeder, extractors, transformers, and loaders (to use the ETL terminology) runs in it's own thread.

The queue end-points are directly connected between threads by the launcher as it creates the objects, but have been designed to be interchangable. For example, a method can be added to connect the end-points with TCP sockets allowing distributed processing.

The logical layout of the application is a Markov Chain of Producer-Consumers. The number of threads, each representing a node in the chain, the number and capacity of the data queues are all run-time configurable.

Objectives and Extensions

The design objectives are high-performance and scalability, and each component is developed and tested with this is mind. By modularising, we are also able to swap out components. We may, for example, develop a transformer class to handle firewall logs, instead of Apache logs, while keeping the overall design unchanged.

We are developing a bulk-loading insertion class that will use the aforementioned LOAD DATA INFILE instead of batch inserts for a five or ten-fold performance increase when pre-loading historical data in a de novo installation.

This method might not be approriate for real-time usage though, due to complications with the foriegn-key relations. We may find that we spend more CPU time duplicating foriegn-key management outside of the database, than we save on the file loads. Partitioning would perhaps scale better than file loading, but this remains to be measured and tested.

We will add visual design documents here later.

Schema

Fundamental to any high-performing application is an efficient datastore.

We refined the schema as we developed the code. Performance enhancements in the code sometimes suggesting DB changes, and integrity requirements in the database feeding back as design changes in the software.

Here is our current schema, showing defined relations.

apache log db schema

Not all of the fields shown are in use, and many are there to allow for future updates. Large data objects, such as strings which are frequently duplicated, are removed to separate tables and a relation established.

By removing duplicates into seperate tables we keep the size of the tables down to a minimum allowing many of them to fit into memory for super-fast selects. The price to be paid is that it is more difficult and time-consuming to update the tables. Hence the performance requirements on the back-end software.

Some of the smaller fields are still to be taken out of the 'hits' table and replaced with relations. The final goal is to have the main 'hits' table contain nothing but integer values and key relations. Those still left to normalise are too small to have much of a performance impact, but if the DB is to scale into billions of rows, then it would become noticable.

Some of the tables are not currently used. Some, such as 'dailyHits' are populated by stored procedures. Also not shown, and crucial to good read performance, is the indexing strategy, which is based on performance testing of the front-end, rather than the back-end.

Front-End

We use Tomcat with a Single Page Application (SPA) for the front-end, calling updates with Ajax. It's design will be the subject of a seperate article since it has to deal with non-performance issues such as session handling, maintaining state and supporting deep bookmarking.

Here are a couple of teasers in screenshot form to give some ideas of what the database can be used for:

Hits Per Minute
Chart shows total hits on the website per minute over the last few hours. The query to generate the data takes a few milliseconds to run and the chart can scroll in real-time.
hits sample 1
We use flot for the charts.

Latest Hits
Formatted list view of the latest hits, with click-through to detailed views.
hits sample 2
(IP addresses removed)

Hits with Bots
We also filter out bot traffic. The following chart shows total traffic and bot hits over a longer time period on a test server. From a testing point of view, if we ever see the bot traffic go above the total traffic we know we have a bug somewhere.
hits3

The front-end also provides many click-through lists for when a deep-dive into the data is required.

The page hit-counter shown on all pages at skankworks.net is also drawn from the same database.

Further updates are planned.

Check back again for progress. We hope to have a demo online soon, but this will require us to obfuscate our demo database to remove real-world IP addresses, User Agents, and other info that could be used to identify a person.

Third Party Analysis

That last point is why we disdain third-party analytics. They require us to share info about our visitors with the third-party, without the visitors knowledge or consent, and beyond our control. This violates our own privacy policy, so we'd rather write our own code than piss on our users by changing our policy.

Conclusions

We had originally written a basic logfile database using php with mysql. This gave us a rate of insertions of around twenty-five per second. Scanning through websites such as stackoverflow we soon discovered that this is typical of the performance to be expected by hastily thrown together scripts.

We were determined not just to do better, but to do the best. Inspired in part by Percona on the MySQL side, and by high-performance Java applications we've worked on for our corporate clients.

We are reasonably satisfied with the four-thousand-fold performance gain we have obtained thus far. From twenty-five inserts per second to the equivalent of one-hundred-thousand, and with a much better schema underlying the code.

What we have demonstrated with our logfile database is that Java and MySQL together offer the best performance money can buy. Or it would, if you had to pay money for them. They are free, and you can spend the money you save on hiring developers and architects that really know how to use them.

Bearing that it mind, and if you'd like some advice on how to achieve similar performance from your own Java/MySQL applications, do not forget that we are for hire. 😉