HashMap Cache

As promised in our article Java: Caching In On Performance earlier this month, on this page we share our class for a HashMap Cache. Read on for the code

This is the latest refinement where we have fully implmented the HashMap and removed all list traversals where we can. In our last article we had a basic cache using a list. Our measurements showed that in a batch run of processing 400,000 apache logfile records the code spent 113 seconds in the cache. Most of this time was spent looping through a list.

With the HashMap class searching the cache becomes a direct-access operation, bringing total cache processing time down to 2.1 seconds. Or just under 200,000 cache operations per second. This is better than we had expected.

Testable Code
In this latest version we still have some redundancy, which is temporary and required to maintain backward-compatibility until we upgrade another module. In this article we will again use the ‘referrers’ sub-table as an example. See Java: Caching In On Performance for more details of the project.

The data we need to cache consists of two integers only, which gives our cache the additional benefit of a small memory footprint. We cache the primary key id, to cut down on the number of database SELECTS, and for each referrer seen we keep a count of duplicates. The duplicate count is periodically added to the database, being cached to cut down on UPDATES.

We also need a hash to use as the key, which we have calculated as part of the initial pre-processing of the referrer string.

We model the cached values with an object. The class is below, with the getters and setters removed for clarity.

public class RefsTable {
	private int ref_id = 0;
	private String refhash = "";
	private int count = 0;
	
	public RefsTable(int ref_id, String refhash, int c){
		this.ref_id = ref_id;
		this.refhash = refhash;
		this.count = c;
	}

        public void incCount(){
                this.count++;
        }
}

When we insert a new referrers line into the database we initialize its counter at 1. Hence when we insert it into the cache we initialize it as zero. Each duplicate row will have the counter incremented.

Currently we construct the object by passing the intial count as hard-coded 0:

 new RefsTable(hitObj.getRef_id(), hitObj.getReferrerHashStr(), 0));

We’ve coded-in the option of passing a value into the constructor, to allow for possible future enhancements.

These objects are cached with the following class, with the ‘incCount’ method being called whenever a cache hit is registered:

public class ReferrersCache {
	
        private Timers listTimer = new Timers();
	private long timeInList = 0L;
	private HashMap refsCache = new HashMap();

	public boolean addReferrerToCache(logLine hitObj){
		listTimer.start();
		boolean out = false;
		refsCache.put(hitObj.getReferrerHashStr(),
				new RefsTable(hitObj.getRef_id(), 
                                hitObj.getReferrerHashStr(), 0));
		listTimer.stop();
		timeInList += listTimer.elapsedMillis();
	        return out;
	}
	
	public boolean clearReferrerCache(){
		boolean out = false;
		for(Entry entry: refsCache.entrySet()) {
			entry.getValue().setCount(0);
		}
		return out;
	}
	
	public boolean isReferrerCached(String refhash) {
		listTimer.start();
		boolean out = false;
		if(refsCache.containsKey(refhash)){
			refsCache.get(refhash).incCount();
			out = true;
		}

		listTimer.stop();
		timeInList += listTimer.elapsedMillis();
		return out;
	}
	
	public int getReferrerKeyFromCache(String refhash){
		listTimer.start();
		int pk = 0;
		pk = refsCache.get(refhash).getRef_id();
		listTimer.stop();
		timeInList += listTimer.elapsedMillis();
		return pk;
	}

	
	public long getTimeInList() {
		return timeInList;
	}

	public HashMap getRefsCache() {
		return refsCache;
	}

}

We do not actually need to keep the hash in the cached object as we’re using this as our key. We also need to add some manual error-checking for memory management into the addReferrerToCache method, hence the currently unused boolean. These are things that will be addressed in the next iteration.

Production Code

We can test this cache as it stands, and refactor out the redundancies when we are happy with the performance prior to user-acceptance testing and production deployment.

We’ve reduced the size of cached objects from over 2k to under 100 bytes per object so out-of-memory exceptions are not the major concern of testing at this stage where we focus on performance and precision.

Note again the use of timers. This is where we’ve measured cache execution times to arrive at the 2.1 seconds total processing time spent in the cache. The final getter method is for the database. When the cache is flushed the database controller class will take a copy of the cache and loop over every value to update the counters in the DB.

We measure accuracy by comparing the sum of all counters in each sub-table with the number of rows in the main table – which should always be equal. An elegant and simple test, aided by inspecting the counters on a few particular rows that have been calculated in advance from the raw test data.

We have four such caches in total, the other three caching host names, urls, and user agent identifiers, using the same design. They could be combined into a generic cache but this would be more of a code-maintenance decision rather than performance.

We prefer keeping them as separate classes as we do not anticipate making any further changes in this area. At least, not until we scale into the billions. We can currently insert 300 million normalised apache log entries per day onto a twin-core database host.

Notes

HashMaps are notorious for two things: Memory leaks and deadlock. We avoid both by periodically clearing the cache thus accepting the overhead of having to rebuild the mappings, and by continuing to give each processing thread it’s own dedicated cache. The memory footprint has been minimised, and further speed gains will be negligable.

For the next performance improvement we shall be turning our attention to the producer classes and optimizing the string tokenizer and string-to-integer conversions on the incoming data with our own customized methods.