Java: Caching In On Performance

In this article we follow up on our real-time log analysis with a deep-dive into the caching strategies we used. Read on to see how we improved our DB insert performance

First Version, Running Without Any Caching

Our code analyzes Apache logfiles in real-time and inserts each line into a normalized mysql database. The main table for each hit is shown below:

table_hits

As can be seen from the foriegn-key relationships this table has been normalized. We do not want to keep large strings or repeated values in the main table. An example sub-table holds the ‘referrers’ string from the logfile:

table_refs

The referrer is one of the normalizable fields that we examine in incoming log entries. It’s a text string (which we truncate at 2k) and will be frequently duplicated in incoming hits. So we move it to a sub-table and reference it with a foriegn key.

The reference from hits is through the ref_id key. When we encounter a new referrer we add it to the referrers table, and insert ref_id, the foriegn-key value, into the hits table. We will use the referrers table as an example in this article, as the same strategy applies to each of the four sub-tables that hold normalized data.

We’ve also added an optimization with the count field. End-users will want to know how many times a particular referring page has sent a visitor to the website under analysis, and this will be a frequent query. So we have added the ‘count‘ field to allow this to be directly looked-up without calculation. This calculation is performed when data is inserted to the DB, not when it is looked-up.

The first entry of a new referrer will have the count set to 1, and each duplicate seen thereafter will increment this count. This provides the fastest possible look-up, and we do the same with visiting hosts, urls visited, and useragents.

Trading Performance, Front-End vs Back-End

The count field provides enourmous benefits to front-end performance, but it has a large impact on the back-end. This is a typical trade-off encountered in tiered applications. With four sub-tables containing counters each log file entry requires five tables to be updated.

New entries will always require an INSERT into the hits table, but the four sub-tables will require an UPDATE to the count field far more often than the insertion of a new row. To implement this in code requires that for each new line a SELECT is performed to see if the value already exists in the sub-table, followed by either an INSERT (for new rows) or an UPDATE if incrementing the counter.

This is slow, but accurate. On our test server we were able to get around 250 log entries per second entered into the database. This is fine for a single-instance of a web-server, but it could be a lot better.

Optimizing SQL, Adding Caches

The next version of the application code used the mysql feature “ON DUPLICATE UPDATE”. This feature duplicates the functionality above, but in one command, eliminating the SELECT.

This is slightly more efficient, but it still requires all four sub-tables to be updated for each incoming row in hits. This feature is myqsl-specific and non-portable and it also has a minor bug that causes fragmented autoincrement sequences.

Although minor, in our case it causes a problem. We need to know the ref_id, which will be the value of the autoincrement. But due to it’s fragmentation keeping track of it is non-trivial. The only reasonable way around it is to select the ref_id from the referrer just updated. Thus we lose some of the savings we anticipated.

Far more efficient would be for the code to keep an internal cache of referrer and a counter of duplicate entries, then periodically mass update the sub-tables with the counters. By caching the foriegn keys we avoid looking them up in the database, and by caching the duplicate count we eliminate a large number of DB updates.

Basic Caching with Linked Lists

Initially we used a simple LinkedList class to cache these values. Our linked list is formed from nodes holding a referrer string, a hash value, and a counter integer.

The program flow in psuedocode for the referrers table:

For each new log entry
 extract referrer
 if(referrer is already in cache) 
   increment cache counter, 
   retrieve foreign key id
 else 
   insert new referrer into DB, 
   select ref_id from DB
   insert new referrer into cache

Periodically we flush the cache to disk. For bulk-loading in an ETL scenario we can use large caches with infrequent flushing, but in a real-time scenario we prefer to flush every second to keep the UI up-to-date.

Flushing psuedocode:

For each entry in cache
 add the cached count to the count stored in the DB
 reset the cached counter to zero

Adding the cached counter to the existing value in the database is easy to implement in SQL:

UPDATE referrers SET count = count + cached_value

With a suitable WHERE clause added to match the ref_id. We reset the cached counter to zero, rather than deleting the cached referrers. This is so that we do not have to rebuild the cache after each flush.

This is more efficient and we can set the cache parameters, for example, to only update the counters in the DB after every 20,000 incoming rows, to maximise memory usage, or after a timer expires to maximize real-time updates on the UI.

Those 20,000 rows would have required 80,000 counter UPDATES in the DB, but with the cache we cut that down to only the number of rows in the sub-table, each of which is updated only once per flush. The number of rows in the sub-tables are typically only a few percent of the number of rows in the main hits table.

Assuming an average normalized table size of 10% of the main table, we have eliminated 72,000 UPDATE statements for every 20,000 incoming rows. The actual value depends on the nature of the incoming data, but 10% is typical for these tables. What would have required 100,000 DB write operations, now requires 28,000.

A substantial improvement

Our performance using this cache went from 250 lines entered per second to over 3,000. A huge performance boost, just with a basic linked-list cache. We made the size of the linked-list, and the flushing criteria configurable from a properties file.

This allows us to tune the application to local hardware for best performance. A larger cache will result in the front-end being a little behind real-time, but with faster performance on adding to the database, whereas a smaller cache can be configured, down to no caching at all, where front-end performance is of more concern.

Experimenting with different cache sizes and flushing rules allowed us to find the best balance. The need to do multiple test runs with identical conditions made the extracting of caching parameters into a properties file essential.

So we do not need to recompile to test different caching parameters, and this is also ideal for application administrators in production environments as it allows fine-tuning to suit requirements.

Refining the Cache

As the code matured we had reached a stage of getting up to 6,000 entries per second, but still far short of our target of 20k. Measurements of the timing of various parts of the code revealed that the time spent traversing the linked-list was unacceptably high.

The unsophisticated use of a linked-list for a cache was simply too slow. But with all the framework for caching in place, we could now refactor for something with far faster look-ups: A hashed array.

We replaced the linked-list with a hash array. Ultimately memory requirements will be the same. If we figure the cache will have 100,000 entries then we create an array of 100,000 elements at start-up. the equivalent LinkedList would start at zero size and grow over time to something approaching the array size.

As before we look-up each referrer to see if it is already cached, and increment the cached counter if it is. We start with a simple array traversal, which we will later replace with a hash map for the fastest possible retrivals.

cache_lookup

We do not go directly to the hashmap for reasons of testibility. We are developing the wider application, including back-end, front-end, and middleware concurrently and each module is updated in increments designed to have minimal impact on upstream and downstream modules.

Each module requires extensive testing and by continually improving the code in small steps we are able to uncover a broader range of test scenarios than we would with a big-bang approach. First of all we need to remove the list methods and replace them with the array, to ensure that there is no impact on other modules. Later we will further refine the array to a map.

For now, we are treating the array as a list and following the same pattern as before. We search the array and if the entry we are searching for is not present it will be inserted into the DB, and added to the cache:

cache_insert

This checks to see if the referrer is already in the cache, which will mostly be true. If the hit, encapsulated in the hitObj class, contains a new referrer it is inserted into the DB and its primary key stored in the setRef_Id method.

Once the key value has been retrieved the new entry is added to the refs cache with the add method. Every duplicate of this referrer will now be able to get the primary key id, required for maintaining the integrity of the hits table, from the in-memory cache instead of the DB. Thus, in addition to the elimination of so many DB UPDATE calls, we also eliminate an equal number of SELECTS.

Note also the use of timers in the code. This is essential for measuring performance at the low-level where operations are repeated millions, if not billions of times. A milli-second saved here will have a major effect on performance.

For more details on the timer class see our article real-time log analysis.

With a basic array instead of a linked-list we saw an immediate doubling of performance. After replacing the array with a hash map we added another 30% gain, and these refinements together bring performance up to 18k entries per second. Pretty close to our target of 20k. The next 10% will be the hardest, but there are several more refinements we can make.

Further Refinements

This is an ongoing project and we have further things to try out. Using the hash array trades memory for performance. The linked-list is better for memory use, and does not require a size to be determined in advance unlike a hash.

List traversal times can be improved by keeping the most frequent entries at the top of the list. This can be implemented by ensuring that each time a cache entry is accessed it is moved to the head of the list. The overhead of this move, a simple operation for list management, is negated by the improved lookup times as the most frequent entries will congregate at the top of the list.

Even more efficient would be to create a hash map, each entry of which contains the head of a linked list. The referrer’s hash would point to which list to use, and each list would be much shorter than using a single list.

Hashing will have to be altered to ensure there are many duplicates. Then we create a hash map of around 1% of the size currently in use. Each entry in this smaller array will not contain a value, it will contain the head of a linked list. Each of these lists will be a 1% of the size of the singleton list originally used, radically reducing list traversal times at the cost of a simple hashmap look-up.

ith the addition of keeping the most frequently accessed values at the head of the lists this will provide the highest cache performance. Once again the number of lists, constrained by the size of the hashmap, and the length of the lists, can be paramaterized and moved into an external configuration file allowing for the same runtime fine-tuning as our current implementation.

It’s something of a trade-off with memory, with performance gains accruing in the smaller hashes. We’ve seen research indicating that this is the best performing caching strategy and it is in our plans to try it out. Gains will not be spectacular, but at this stage of code maturity another 5% gain will be more than we could hope for.

Dedicated vs Shared Caches

Another issue is multi-threading. Currently each of our threads maintain their own caches. This eliminates any possibility of deadlock, at the cost of more memory and more flushing. It also introduces some subtle issues where one cache could be looking up an entry that doesn’t exist just as another is writing it to the DB. It would result in the occasional counter being out by one.

Not a major issue, and it could be worked around with error handling on any failed inserts that occur.

A shared cache would eliminate redundancy and the need for such error handling, and be more efficient still. However, great care would have to be taken with the implementation of a shared-cache to ensure it is thread-safe. ConcurrentHashMaps have this capability, but we have a solution in mind based on queuing theory (which is used extensively thoughout our application, itself a Markov Chain).

This is also more inline with our design principle of modularity. A modular cache can operate as a standalone, with adaptors on the incoming queues to allow for a wide variety of deployment scenarios.

We would anticipate having all cache writes being placed on a blocking queue that only the cache can read. This would require all cache reads to be denied while there are new elements waiting on the queue, in order to avoid miscounts.

This suggests a hierarchiacal queuing strategy. A token-bucket seems ideal, with INSERTS of new entries to the sub-tables having a high-priority token, and UPDATES to existing entries lower priority. All cache reads would need to be denied while there were entries on the high-priority queue.

Such a cache would be slower on start-up, when there is a preponderance of new entries over duplicates, but as the caches fills performance will steadily rise to a peak. Precise timings of waiting times would need to be taken, and specific test data constructed to ensure accurate throughpout measurements of the priority queue.

As we get closer to the theoretical maximum performance our hardware can offer, the more sophisticated our code becomes.