Welcome!

Java Authors: Liz McMillan, Hari Gottipati, Tad Anderson, Yakov Fain, Pat Romanski

Related Topics: Java

Java: Article

Lazy Programmers Can Make Cache Today

Lazy Programmers Can Make Cache Today

Let me begin with a philosophical rant. There is a motto from scientific computing that carries over to many areas of computer science:

The gains made by better algorithms almost always outstrip the gains from better hardware.

I've frequently seen where algorithm improvements pay by factors of ten to tens of thousands in CPU time. One change I made in a numerical algorithm improved CPU requirements by a factor of 50,000: from weeks on a super computer to minutes on a workstation.

Any business-savvy engineer knows that algorithm improvements come at a price: the engineer's time. Striking that balance makes software systems move forward rather than stagger to a halt in bloat and dysfunction. It also helps to use people who actually know what they are doing - knowing how to compile code doesn't make you a software engineer any more than knowing how to spell makes you a writer. End of rant.

On to (rant-related) business. On most Web sites, think of how many times a data source will be used to retrieve the same data and produce the same content over and over again. Most successful services deliver a highly redundant amount of information to their users. For example, the JDJ Web site will deliver this (same) content to perhaps a hundred thousand users. If the servers are overtaxed, customers will experience significant delays or malfunctions.

There are several useful solutions to this. Well-configured caching proxy servers come to mind, although server-side scripting makes this difficult. Buying more hardware will eventually fix the problem, which may be the correct business solution.

But what about asking programmers to be a little more lazy?

The source for the LazyFileOutputStream can be downloaded from http://bpp.sourceforge.net/download/bpp-0.8.5b/src/bpp/LazyFileOutputStream.java. It acts just like a regular FileOutputStream except that if created on a file that already exists, it "reads" the data from the file instead of writing it. The stream compares what is already in the file with what you are currently writing to it. If at any point it sees that there is a difference in the data you are writing this time compared to what is already there, the stream automatically switches to a write mode that writes over the remainder of the file with the changes.

The upshot is, if your program generates the same output twice, the output file is unmodified the second time (leaving the original modification date). First, by simply changing FileOutputStream to LazyFileOutputStream, any downstream processing can use timestamp information on the files to check if they need to do anything at all. If the timestamp hasn't changed, neither have the contents.

But wait, there's more! In addition to the standard close(), the LazyFileOutputStream also supports abort(). This method effectively states "I'm done now, leave the rest of the file alone." The remainder of the file will be the same, even without reproducing it. This means that if you determine at an early point in the processing of the file that it's going to turn out the same, you can simply abort() to leave it alone. It's similar to the idea of not changing the modification dates on files that are rewritten with the same data, but allows you to save CPU time for the current process step as well as downstream processing.

Certain engines produce part of the template before you can conveniently intervene to decide if you really need to regenerate it. By opening up the output as a Lazy file, you can just abort() early and have the old version with the the old modification time around for downstream processing.

Okay, rant concluded and point made: CPUs around the planet are spinning through the same data tens of thousands of times, producing the same content tens of thousands of times. Instead of buying great big servers to manage this, a smart caching policy based on lazy file writers and some modification time testing could save some sites that same wild-sounding factor of 50,000, without having to buy 50,000 new servers.

Anecdote #1
There is a certain technical advantage to this style of writing data as well: most storage devices are easier to read from than write to, adhering to the 80/20 rule - 80% of file access will be reads, 20% will be writes. The LazyFileOutputStream takes advantage of that for the many files that are simply rewritten with the same content.

Anecdote #2
There must be a few curled toes out there saying to themselves, "Why not LazyFileWriter?" There are good technical reasons for the OutputStream: the logic of the data written must be checked in its raw "byte" format for the idea to work correctly, and you can always wrap this in an OutputStreamWriter followed by a BufferedWriter, which is what I recommend.

Now I'm even done with the anecdotes. Have a nice day.

More Stories By Warren MacEvoy

Warren D. MacEvoy is Asst Professor of Comp Science in the department of Computer Science, Mathematics & Statistics at Mesa State College, Grand Junction, Colorado.

Comments (1) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Infernoz 01/16/05 10:16:45 PM EST

It sounds weird, but may work, provided that the old file content is in memory, if not, you may loose any speed benefit to massive disk seek/read time delays.

I've found that just opening/close files can be very costly too e.g. for mux/demux file processing, so it can be helpful to use an LRU cache for any Buffered (min 64Kbyte) File Readers/Writers, to reduce all of the I/O costs, this idea could be extended to support random access read/write caching.