Project Computing







Results Gallery
LogSummary program

It is useful to have an idea of the volumes of material pageVault would archive when run on your web server. Running the pageVault trial will give you some indication, as will running the LogSummary program.

The LogSummary program reads one or more web server log files and attempts to estimate what percentages of responses are "novel" and hence archivable by pageVault.

The program is very simple minded - it just inspects the date, URL, HTTP resp code and content-length fields. As the actual bytes of the response are not available, it can't really know whether a response is novel, so it guesses based on the URL and content-length (as logged).

It ignores log lines it can't understand, and those with non-200 HTTP response codes. For the remaining lines it checks whether the URL has been seen before, and if so, whether the content length has changed.

If not seen before, it counts the response as "new". If seen before but with a different length, it counts the response as "updated". Otherwise, it counts the response as "same".

For each day in the log(s) processed, it generates a report showing counts, bytes and percentages of responses in each category (same, new, updated). Note that the reporting base is cumulative, so you'd expect to see fewer "new" responses as time goes on for most web sites.

The parameters to this program are a series (1 or more) of input log files to read, assumed to contain log entries in ascending date order.

The format of the log entries is expected to be something like this: (Apache format): - - [17/Oct/2002:14:05:08 +1100] \
	"GET /cgi-bin/test?p=x HTTP/1.1" 200 2141

The LogSummary program is written in Java, and should be compilable and runnable with any version of Java after and including 1.2. It is provided here in source and compiled form for you to download and use as you wish. The version provided here reads logs in the Apache standard format, but it should be easy to modify to read other log formats.

The program generates a report to "standard output" in HTML format. To view, redirect standard output to a file with an .html extension and open the file with any web browser.

We encourage you to send us output from your site logs, which we'll add to our gallery of results. We won't identify your site other than using a generic description you provide us (such as Large commercial site with dynamic pages and a heavy search emphasis, Medium government site with a large amount of new content each day).

For an example of the output generated by LogSummary, view the output gallery.


Web server logs do not contain enough information to produce an accurate representation of pageVault archiving volumes. In particular:

  1. genuinely different content with the same URL and response length is not counted as "updated", and
  2. identical content with different response lengths caused by differences in HTTP/1.0 and HTTP/1.1 responses is counted as "updated"
  3. "materially" identical content with different response lengths caused by immaterial variations (such as date/time, hit counter, cookie-derived name etc) is counted as "updated".
  4. POSTed data is not available in the logs. Hence, different responses for the "same" URL may often be caused by it processing different POSTed data. In this case, an "update" is counted when it is more properly classified as an "add" (that is, the POSTed data should be appended to the url, as is GETed data). This doesn't change the "same/not-same" split, but it can disconcertingly inflate the "updated" data statistics.

Hence, the output of LogSummary must be taken with a very large grain of salt. It will typically overestimate the amount of updated content, sometimes quite dramatically. It represents a first step in estimating volumes - a ballpark figure, but no more!

Download LogSummary


Compile like this: javac

As a Java class: LogSummary.class (save, don't view!)

Run like this:
 java LogSummary access-log1 access-log2 access-log3 ... > summary.html

Project Computing Pty Ltd    ACN: 008 590 967