Project Computing - pageVault FAQ

What problems does pageVault solve?
1. Knowing exactly what content your web site delivered to every request
  
  pageVault archives all novel responses, whether static html, dynamic html generated from Perl, cgi-bin, servlets, scripts, server-side includes, databases or any other method, images, documents, sounds, ...
  
  So, for example, dynamically generated responses to search queries are archived equally with responses to requests for the home page. Every unique response is archived.
2. Providing a history of how your web site looked at a certain date and time
  
  pageVault provides easy access to the archive based on URL or date/time. You can see how your site looked at any date/time since you started running pageVault. You can see how a page changed, and even replay the evolution of a page over time, seeing changes down to the byte level exactly as they were sent from the web server to you site's visitors.
3. Backup and recoverability of content
  
  As a side-effect of archiving every distinct response sent by your webserver, pageVault provides the ability to recover html, an image, a stylesheet, external JavaScript, or anything else sent by your web server.
  
  Although some content-management systems have the capability for keeping previous versions of some of these components, pageVault keeps everything ever sent, in one repository, easily accessed by URL and date.
Architecture
1. How does pageVault work?
  The filter component of pageVault sits inside the web-server's address space. It inspects each HTTP request, and if the pageVault configuration specifies that the request is of a type which should be considered for archiving, it then inspects the response.
  
  The byte-stream making up the response is charactertised by constructing a checksum. pageVault can be configured to ignore certain parts of particular responses as being "non-material": these parts are excluded from the checksum calculation (but will be included in the archive if the response is archived). The HTTP response header is also excluded from the checksum calculation.
  
  If the checksum of the response is different from the checksum calculated for the previous request for the same resource, pageVault archives the response.
  
  For more details, see Web site archiving - an approach to recording every materially different response produced by a website, a refereed paper presented at AusWeb03.
2. You can't be serious! How can you possibly archive everything produced by a web site?
  Yes, we are serious! The pageVault design works well because:
  - Only "unique" never-before-seen response bodies are archived.
  - Response headers change for almost every request - different timestamp, cookies, etc. But most response bodies are not unique - they are identical to response bodies sent before.
  - Some otherwise identical response bodies are made unique by the insertion of "non-material" or "non-substantive" content, such as a text hit counter, the server date/time, innane personalisation such as "Welcome to our web site, Mary". pageVault can be easily configured to ignore such content for the purposes of deciding whether a particular response is unique.
  - Entire responses can be ignored based on their content-type or url (starting or ending strings, or regular expression matching). For example, you may not be interested in always archiving mp3's or zip files.
  - Archived material is compressed in transmission to the archiver and stored in compressed format.
  - pageVault has been designed and coded from the ground up to consume as few resources as possible, and we live in a world where disk storage costs just a dollar or two per gigabyte.
3. What are the components of pageVault?
  pageVault consists of these components:
  1. Filter
    Runs inside the web server's address space, identifying potentially unique request/response pairs and writing them to disk. Because the filter does not have a global view of the web server (many web servers are at least multi-processing and often run across several machines at different locations), and only maintains a local and recent history of what responses have been sent, it will generate some "false positives": request/response pairs which are not really unique. These false positives are identified in subsequent components.
  2. Distributor
    Runs as a separate process, usually on the same machine as the web server, reading the temporary disk files of potentially unique responses generated by the filter. The distributor is able to immediately identify most of the false positives generated by the filter, removing them from further processing. The request URL and checksum of the remainder are sent to the archiver component, and if deemed unique by the archiver, the distributor compresses the response and sends it to the archiver.
  3. Archiver
    Runs as a separate process, usually on a separate machine from the web server/filter/distributor. The archiver maintains a persistent database of archived requests. When sent a request URL and checksum from the distributor, the archiver uses this database to determine if the request/response is unique. If it is unique, it solicits the complete details from the distributor and stores it in the archiver database.
    
    The archiver also exposes a query interface used by the query servlet to search and retrieve from the archive.
    
    Because the archiver can process responses from multiple distributors, this architecture lends itself to the establishment of web notaries and "federated" or "union" archives of web content.
  4. Query servlet
    Runs as a servlet in a Java Servlet framework, such as Tomcat or Jetty. Provides a search and retrieval frontend to the archiver's database.
4. Can pageVault tell me what web pages were seen by a specific person?
  No, it cannot. PageVault can tell you exactly what content was delivered but in general, no system can tell you with 100% confidence "who saw it". There are many fundamental problem in determinining who has seen what:
  - Web Proxies "mask" the identity of the orginal requestor.
  - Web Proxies/caches deliver cached versions of content without reference to the original server.
  - Search engines (such as Google) and archives (such as The Internet Archive) deliver cached versions of content without reference to the original server.
  - User identification is problematic. Even when IP-addresses survive through proxies, it is impossible to reliably bind IP-addresses to people for many reasons including:
    - dynamic IP assignment - DHCP
    - use of Network Address Translation (NAT)
    - use of ISP's (transitory and often internal NAT addresses)
    - movement of IP blocks between organisations
    - "anonymizer" services
  - Even with authenticated user access, is the person behind the screen the "rightful" user of the userid?
5. What about alternative approaches?
  The only alternative approach that we know of is Vignette's (was Tower's) webcapture. We've attempted an unbiased as possible comparison here.
Requirements
1. What web servers does pageVault work with?
  pageVault version 1.1 works with Apache version 2.0.40 and later. Version 1.2 (released 30 March 2003) also supports Microsoft's IIS.
2. What versions of UNIX and Windows are supported?
  pageVault has no direct operating system dependencies. All that is required is an operating system which runs a supported web server and the required version of Java.
3. What version of Java is required?
  Java JVM 1.4.0 or later.
4. What extra hardware resources on the web server are required?
  The overhead imposed by pageVault on a typical web server environment is very small, and typically no extra hardware provisioning is required.
  
  Unless the machine running your web server is very heavily loaded, you should notice very little if any response time degradation with pageVault, because the CPU and memory loads imposed by pageVault on most requests is very small. Basically, the main CPU load is the calculation of the checksum of the response, which is a very efficient operation.
  
  As little processing as possible takes place in the critical path of generating the response to the user. The pageVault distributor, which typically runs on the same machine as the web server, can run at a lower dispatching priority than the web server and will generally consume few resources and only then when many unique web responses are being generated.
  
  The typical web server spends most of its time sending responses it has sent previously, and the overhead of pageVault in these circumstances is very low.
  
  See also the answer to the question: How can I minimise the resources used by pageVault?
5. What do I need to run the archiver?
  We recommend that the archiver be run on a separate machine from the web server for these reasons:
  - It doesn't need to run on the same machine as the web server. Adding the archiver to the web server just makes the web server more complicated, and complicated is bad.
  - Web servers are often "exposed" in a DMZ network. You probably want to protect your archives from intruders who gain access to your web server.
  - The archiver will be busiest when the web server is busiest. It makes sense to not have these two correlated loads running on the same machine.
  - A single archiver can store information from many web servers.
  The pageVault archiver does not require a high-end system. Any recent commodity PC (eg, 2MHz Pentium/Celeron) running Windows or Linux (or any other operating system for which a standard Java 1.4 JVM is available) with 128MB of memory will suffice.
  
  Given that, the main resource used by the archiver is disk space. All text and HTML responses are compressed using the ZIP algorithm. The amount of "novel" responses generated by your web server determine the space it required. Generalisations are very hard to make - a "typical" medium sized commercial or government web site might change/update less than 5MB of content per day and generate 50MB of novel dynamic responses (unique query/search responses), which may require after compression 20 MB of storage per day, 120 MB per week, 5 GB per annum.
  
  However, a very dynamic site with customised content for a large number of users could exceed those figures by at least an order of magnitude. Remember that pageVault allows to to define "non material" differences in content, greatly reducing the volume of "materially novel" responses.
  
  Regardless, pageVault is designed to handle very large transaction loads and storage volumes. And the decreasing price of disk storage means that even a completely mirrored annual volume of 100GB of archived data requires only an expenditure of a few hundred dollars of disk storage.
  
  To get an estimate of how much "novel" content your web server generates, run the simple LogSummary program available here. This program reads one or more of your web server's access logs and estimates based on the logged date, url, and content-length, how much content is "new", "updated" and "the same".
6. What do I need to run the viewer?
  The pageVault Viewer requires a Java servlet container which conforms to version 2.3 of the Java Servlet specification. The freely available and widely used Apache Tomcat servlet container is highly recommended.
  The Viewer does not have to be hosted on the same machine as the pageVault Archive as it communicates with the Archive using TCP/IP. However, co-location may be simplify installation, management and security (by disallowing non-local access to the Archiver query port).
7. Why does the viewer require a dedicated servlet container?
  The pageVault Viewer allows you to browse requests delivered by the archived web sites at "a point in time". To do this, the pageVault Viewer tries to alter every URL on pages it retrives from the Archiver to include a timestamp and a prefix which will send the request back to the Viewer. This means that when you click on a hyperlink in a web page shown by the viewer, the request is handled by the Viewer which retrieves the relevant response from the archive.
  
  But a problem arises for URL's which pageVault cannot alter, such as those generated by javascript running within the delivered page. Attempting to interpret the javascript and alter it appropriately is not a viable approach. So, occasionally a request for a page will be made which does not include the expected prefix and timestamp. The pageVault viewer attempts to guess the timestamp by inspecting the referer header received with the request. However, these requests will have arbitrary URLs, hence, the servlet environment must be configured to deliver every request to the pageVault Viewer servlet, and hence the only context which can run by the servlet container is pageVault.
Installation
Operation
1. Does pageVault understand virtual servers?
  Yes. You can configure separate content select/reject rules and content exclusion rules on a per "virtual server" basis. Archived content from different virtual servers can be processed by separate distributors and stored by separate archivers. Alternatively, content from different virtual and even physical servers on different machines can be store by the same archiver, creating a consolidated archive for multiple web sites.
2. Can pageVault support web server farms?
  Yes. Content from separate distributors (usually on the separate physical machines making up the server farm) can be sent to a single archiver. The archiver indexes content based on the requested URL (and posted data), not on the physical machine delivering the content.
  
  Because the distributor and archiver communicate via a parsimonious protocol transported by TCP/IP, the archiver and distributor are normally on separate machines and can be located anywhere on the internet.
3. Can pageVault handle content delivered by SSL?
  Yes. pageVault operates on the decrypted request and the response before it is encrypted within the web server.
4. How can I minimise the resources used by pageVault?
  1. Exclude content-types and URL's which are of no interest.
    
    By default, pageVault will process every response. However, by using the PageVaultAcceptContentType, PageVaultRejectContentType, PageVaultAcceptURL, PageVaultRejectURL, PageVaultAcceptURLAndQuery and PageVaultRejectURLAndQuery directives you can "short-circuit" pageVault processing by removing it from the web server's processing loop at a very early stage in processing, effectively eliminating pageVault overhead in processing these requests.
    
    Content-type accept/rejects are the most efficient because the machine instruction length to compare the targetted content-type with the response's content-type is very short.
    
    When using accept/rejects based on URLs and URLs and query strings, follow these guidelines for efficiency:
    - use the starting or ending options in preference to the regexp option
    - use the exactCase option in preference to the anyCase option
    - when using the regexp option, make the regular expression as simple and fast to evaluate as possible (refer to the guidelines on efficient regular expression specification)
    pageVault can also be enabled/disabled at the virtual server level.
  2. Create rules for defining content which is "not material"
    
    Some web pages may contain content which is "not material" in determining whether the "information content" of the page has changed. For example, a text "hit counter", a salutation such as "Good morning Jim" or other dynamically generated content which may slightly personalise the page often would not warrant the archiving of the page as being a unique response.
    
    pageVault allows you to define rules for a single URL or matching a set of URL's (based on starting or ending strings or a regular expression) which specify the start and end markers of content to be excluded for the purpose of deciding whether the response is novel and hence needs to be archived.
    
    Refer to the reference manual for the details on how to do this efficiently.
  3. Tuning the PageVaultBufferSize parameter
    PageVaultBufferSize specifies the amount of memory pageVault uses per request to buffer the response. The idea here is to fit "most" of the responses in memory to avoid having to write a response to disk before it is complete and hence before the checksum of the response and hence its uniqueness can be determined. But a large value of this parameter increases memory usage by the web server.
    
    Refer to the reference manual for a discussion of the tradeoffs in setting this parameter.
  4. Tuning the PageVaultHashTableSize parameter
    PageVaultHashTableSize specifies the size of the per-process cache which tracks reponses and their checksums. A large cache more efficiently identifies and hence winnows out duplicate responses before they are written to disk and processed by the pageVault distributor, but also consumes more memory.
    
    Refer to the reference manual for a discussion of the tradeoffs in setting this parameter.
5. Does pageVault only archive HTML?
  No - pageVault will archive any content generated by a web server: html, text, images, Microsoft Office documents, PDF, sound... All content is treated as a bytestream by pageVault. However, pageVault can be easily configured to not archive responses, based on either content type or flexible URL pattern matching.
6. Can I perform free-text searches on the archive?
  Yes, pageVault now supports free text searches on the archive contents.
  - When more than one word is entered, words are ANDed together and the words may appear in any order unless...
  - ... you specify a phrase search by enclosing search words in double quotes, or...
  - ... you preceed a word with a hypen, which acts as a "NOT" operator.
  - OR searches are not supported.
  - All searches are case insensitive.
  - The entire contents of a document (including markup) are searched.
  - This is really a "string" rather than a word search, in that word boundaries are not recognised. That is, a search for red will be satisfied by any of these strings: red redish shredded.
  Examples:
  - Roman Catholic will match documents containing both of the strings roman and catholic
  - Catholic -roman will match documents containing the string catholic but which do not contain the string roman
  - "10 green bottles" -wall -fall documents containing the string 10 green bottles (as 16 continguous characters) but which do not contain the strings wall or fall
  pageVault does not build an inverted list of words in the archive, so that whilst free-text searching does not increase disk space requirements, very broad free text searches may take a "significant" amount of time:
  - free text searching is very much IO constrained, so the speed you'll observe is largely dependent on the IO rather than the CPU capabilities of your archiver hardware. As a guide, the archiver running on a "commodity" 2GHz Celeron system with 7200 rpm IDE disk under Windows XP can process about 300 archived files per second (decompress, scan for text) whilst keeping the CPU about 30% busy.
  - hence, unless your archive consists of just a few thousand html files, it is probably a good idea to restrict the search to a part of your archive. Eg, if the text you are looking for is in the "mediaRelease" subdirectory, to provide a url such as www.mysite.com/mediaRelease/*.
7. Can pageVault archive FTP responses?
  No, pageVault only operates on HTTP requests and responses.
8. How should the archiver repository be backed up?
  It depends on your organisations preferences and practices. Options include:
  - Mirror the disk used to store the repository
  - Use RAIDed disk for the repository
  - Periodically halt the archiver, use standard full or incremental disk backup software (commerical or an open source synchronisation tool such as rsync) and restart the archiver. Note that data to be archived will be held in a disk-based queue by the distributor and will be stored when the archiver is restarted.
  Depending on demand, a future version of pageVault may contain a replication facility for automatically maintaining a mirror of the repository on another system.
9. I don't want to have to manage the archive - can you do it for me?
  Yes. Running the archive is not particularly onerous, but you may want to "outsource" this task to an organisation providing what is effectively a "notary" service for your web communications.
  
  Please contact us if you'd like to discuss this.
Support, Trial, Licencing
1. Who is Project Computing?
  Project Computing is a small software house which has been successfully producing custom and packaged software for over 20 years.
2. Where's the pageVault email group?
  The pageVault Discussion Group is managed / archived by Yahooo Groups.
3. Can I trial pageVault?
  Yes, a 30 day trial version is available by completing the trial application form here. This version is restricted to archiving web pages "harvested" using the GNU wget utility, but provides a complete demonstration of pageVault operating on your live web site data.
4. What database software does the archiver use and how much maintenance does it require?
  The pageVault archiver uses the open-source JDBM B+Tree index, with performance and operational enhancements implemented by Project Computing and contributed back to the JDBM project. B+Tree indices are very fast and extremely scalable, and JDBM is an excellent B+Tree implementation which requires no ongoing maintenance.
5. How do I license pageVault? What does it cost?
  You can order a license for pageVault by completing this online order form or by contacting Project Computing.
  
  One license is required on each physical web server machine running pageVault, regardless of number of web servers, web sites serviced or CPUs.
  
  For further information and pricing, see the pageVault license page.
6. What's coming up?
  - Access Log correlation - Version 1.4, Feb 2004
  All registered users of Version 1.1, 1.2 and 1.3 qualify for a free upgrade to these versions.