pageVault
> FAQ <
Reference
Trial
License
Screenshots
LogSummary
|
Frequently Asked Questions
Questions
Answers
What problems does pageVault solve?
Knowing exactly what content your web site delivered to every request
pageVault archives all novel responses, whether static html, dynamic
html generated from Perl, cgi-bin, servlets, scripts, server-side includes, databases
or any other method, images, documents, sounds, ...
So, for example, dynamically generated responses to search queries are archived
equally with responses to requests for the home page. Every unique response is archived.
Providing a history of how your web site looked at a certain date and time
pageVault provides easy access to the archive based on URL or date/time. You
can see how your site looked at any date/time since you started running pageVault.
You can see how a page changed, and even replay the evolution of a page over time,
seeing changes down to the byte level exactly as they were sent from the web server
to you site's visitors.
Backup and recoverability of content
As a side-effect of archiving every distinct response sent by your webserver,
pageVault provides the ability to recover html, an image, a stylesheet,
external JavaScript, or anything else sent by your web server.
Although some content-management systems have the capability for keeping
previous versions of some of these components, pageVault keeps everything
ever sent, in one repository, easily accessed by URL and date.
- Architecture
- How does pageVault work?
The filter component of pageVault sits inside the web-server's address space.
It inspects each HTTP request, and if the pageVault configuration specifies that
the request is of a type which should be considered for archiving, it then
inspects the response.
The byte-stream making up the response is charactertised by constructing
a checksum. pageVault can be configured to ignore certain parts of particular
responses as being "non-material": these parts are excluded from the checksum
calculation (but will be included in the archive if the response is archived).
The HTTP response header is also excluded from the checksum calculation.
If the checksum of the response is different from the checksum calculated
for the previous request for the same resource, pageVault archives the
response.
For more details, see
Web site archiving - an approach to recording every materially different response produced by a website, a refereed paper presented at AusWeb03.
- You can't be serious! How can you possibly
archive everything produced by a web site?
Yes, we are serious! The pageVault design works well because:
Only "unique" never-before-seen response bodies are archived.
Response headers change for almost every request - different timestamp,
cookies, etc. But most response bodies are not unique - they are
identical to response bodies sent before.
Some otherwise identical response bodies are made unique by the
insertion of "non-material" or "non-substantive" content, such as
a text hit counter, the server date/time, innane personalisation such
as "Welcome to our web site, Mary". pageVault can be easily configured
to ignore such content for the purposes of deciding whether a particular
response is unique.
Entire responses can be ignored based on their content-type or url (starting
or ending strings, or regular expression matching). For example, you may
not be interested in always archiving mp3's or zip files.
Archived material is compressed in transmission to the archiver and stored
in compressed format.
pageVault has been designed and coded from the ground up to consume as few
resources as possible, and we live in a world where disk storage
costs just a dollar or two per gigabyte.
- What are the components of pageVault?
pageVault consists of these components:
- Filter
Runs inside the web server's address space,
identifying potentially unique request/response pairs
and writing them to disk.
Because the filter does not have a global view of the
web server (many web servers are at least multi-processing and often run
across several machines at different locations), and
only maintains a local and recent history of what responses have been sent,
it will generate some "false
positives": request/response pairs which are not really unique. These false positives are identified in subsequent components.
- Distributor
Runs as a separate process, usually on the
same machine as the web server, reading the temporary disk files of
potentially unique responses generated by the filter.
The distributor is able to immediately identify most of the
false positives generated by the filter, removing them from
further processing. The request URL and checksum of the remainder
are sent to the archiver component, and if deemed unique by the
archiver, the distributor compresses the response and sends it
to the archiver.
- Archiver
Runs as a separate process, usually on a
separate machine from the web server/filter/distributor.
The archiver maintains a persistent database of archived
requests. When sent a request URL and checksum from the
distributor, the archiver uses this database to
determine if the request/response is unique. If it is
unique, it solicits the complete details from the distributor
and stores it in the archiver database.
The archiver also exposes a query interface used by the
query servlet to search and retrieve from the archive.
Because the archiver can process responses from
multiple distributors, this architecture lends itself
to the establishment of web notaries and "federated"
or "union" archives of web content.
- Query servlet
Runs as a servlet in a Java Servlet
framework, such as Tomcat or Jetty. Provides a search and
retrieval frontend to the archiver's database.
- Can pageVault tell me what web pages were seen by a specific person?
No, it cannot. PageVault can tell you exactly what content was delivered
but in general, no system can tell you with 100% confidence "who saw it".
There are many fundamental problem in determinining who has seen
what:
- Web Proxies "mask" the identity of the orginal requestor.
- Web Proxies/caches deliver cached versions of content without reference to
the original server.
- Search engines (such as Google) and archives (such as
The Internet Archive)
deliver cached versions of content without reference to
the original server.
- User identification is problematic. Even when IP-addresses survive through
proxies, it is impossible to reliably bind IP-addresses to people for many
reasons including:
- dynamic IP assignment - DHCP
- use of Network Address Translation (NAT)
- use of ISP's (transitory and often internal NAT addresses)
- movement of IP blocks between organisations
- "anonymizer" services
However, some pages may require authentication by your web server. When the
authenticated userid is logged, the access logs produced by your server may be
matched against the timestamped content in the pageVault archive to allow a
reconstruction of possible content seen by the user (the access log
correlation feature is scheduled for Version 1.4, August 2003). However,
extreme care must be taken is assuming that the "then-live" content of the
web site was seen by the user due to the impact of web caches mentioned above.
Content delivered by SSL is not cacheable and hence correlation can be
performed with greater confidence.
- Even with authenticated user access, is the person behind the
screen the "rightful" user of the userid?
- What about alternative approaches?
The only alternative approach that we know of is Vignette's (was Tower's) webcapture.
We've attempted an unbiased as possible comparison
here.
Requirements
- What web servers does pageVault work with?
pageVault version 1.1 works with Apache version 2.0.40 and later.
Version 1.2 (released 30 March 2003) also supports Microsoft's IIS.
- What versions of UNIX and Windows are supported?
pageVault has no direct operating system dependencies. All that is
required is an operating system which runs a supported
web server and the required version of Java.
- What version of Java is required?
Java JVM 1.4.0 or later.
- What extra hardware resources on the web server are required?
The overhead imposed by pageVault on a typical web server environment
is very small, and typically no extra hardware provisioning is required.
Unless the machine running your web server is very heavily loaded,
you should notice very little if any response time degradation with
pageVault, because the CPU and memory loads imposed by pageVault on
most requests is very small. Basically, the main CPU load is the
calculation of the checksum of the response, which is a very efficient
operation.
As little processing as possible takes place in the critical path
of generating the response to the user. The pageVault distributor,
which typically runs on the same machine as the web server, can run
at a lower dispatching priority than the web server and will generally
consume few resources and only then when many unique web responses
are being generated.
The typical web server spends most of its time sending responses
it has sent previously, and the overhead of pageVault in these
circumstances is very low.
See also the answer to the question: How can
I minimise the resources used by pageVault?
What do I need to run the archiver?
We recommend that the archiver be run on a separate machine from the
web server for these reasons:
- It doesn't need to run on the same machine as the web server. Adding
the archiver to the web server just makes the web server more complicated,
and complicated is bad.
- Web servers are often "exposed" in a DMZ network. You probably want to
protect your archives from intruders who gain access to your web server.
- The archiver will be busiest when the web server is busiest. It makes
sense to not have these two correlated loads running on the same machine.
- A single archiver can store information from many web servers.
The pageVault archiver does not require a high-end system.
Any recent commodity PC (eg, 2MHz Pentium/Celeron) running Windows or Linux
(or any other operating system for which a standard Java 1.4 JVM is available)
with 128MB of memory will suffice.
Given that, the main resource used by the archiver is disk space.
All text and HTML responses are
compressed using the ZIP algorithm. The amount of "novel" responses
generated by your web server determine the space it required.
Generalisations are very hard to make - a "typical" medium sized commercial
or government web site might change/update less than 5MB of content per day
and generate 50MB of novel dynamic responses (unique query/search
responses), which may require after compression 20 MB of storage per day,
120 MB per week, 5 GB per annum.
However, a very dynamic site with customised content for a large
number of users could exceed those figures by at least an order of magnitude.
Remember that pageVault allows to to define "non material" differences in
content, greatly reducing the volume of "materially novel" responses.
Regardless, pageVault is designed to handle very large transaction loads
and storage volumes. And the decreasing price of disk storage means that
even a completely mirrored annual volume of 100GB of archived data
requires only an expenditure of a few hundred dollars of disk storage.
To get an estimate of how much "novel" content your web server generates,
run the simple LogSummary program available here.
This program reads one or more of your web server's access logs and estimates
based on the logged date, url, and content-length, how much content is
"new", "updated" and "the same".
What do I need to run the viewer?
The pageVault Viewer requires a Java servlet container which conforms
to version 2.3 of the Java
Servlet specification. The freely available
and widely used Apache Tomcat
servlet container is highly recommended.
The Viewer does not have to be hosted on the same machine as the pageVault
Archive as it communicates with the Archive using TCP/IP. However,
co-location may be simplify installation, management and security (by
disallowing non-local access to the Archiver query port).
Why does the viewer require a dedicated servlet container?
The pageVault Viewer allows you to browse requests delivered by the archived
web sites at "a point in time". To do this, the pageVault Viewer tries to alter every URL on
pages it retrives from the Archiver to include a timestamp and a prefix which will
send the request back to the Viewer. This means that when you click on a hyperlink
in a web page shown by the viewer, the request is handled by the Viewer which
retrieves the relevant response from the archive.
But a problem arises for URL's which pageVault cannot alter, such as those
generated by javascript running within the delivered page. Attempting to
interpret the javascript and alter it appropriately is not a viable approach.
So, occasionally a request for a page will be made which does not include
the expected prefix and timestamp. The pageVault viewer attempts to guess
the timestamp by inspecting the referer header received with the request.
However, these requests will have arbitrary URLs, hence, the servlet environment
must be configured to deliver every request to the pageVault Viewer
servlet, and hence the only context which can run by the servlet container
is pageVault.
Installation
Operation
- Does pageVault understand virtual servers?
Yes. You can configure separate content select/reject rules and
content exclusion rules on a per "virtual server" basis. Archived
content from different virtual servers can be processed by separate
distributors and stored by separate archivers. Alternatively,
content from different virtual and even physical servers on different
machines can be store by the same archiver, creating a consolidated
archive for multiple web sites.
- Can pageVault support web server farms?
Yes. Content from separate distributors (usually on the separate physical
machines making up the server farm) can be sent to a single archiver.
The archiver indexes content based on the requested URL (and posted data), not
on the physical machine delivering the content.
Because the distributor and archiver communicate via a parsimonious
protocol transported by TCP/IP, the archiver and distributor are normally
on separate machines and can be located anywhere on the internet.
- Can pageVault handle content delivered by SSL?
Yes. pageVault operates on the decrypted request and the response before it
is encrypted within the web server.
- How can I minimise the resources used by pageVault?
Exclude content-types and URL's which are of no interest.
By default, pageVault will process every response. However,
by using the PageVaultAcceptContentType,
PageVaultRejectContentType, PageVaultAcceptURL,
PageVaultRejectURL, PageVaultAcceptURLAndQuery
and PageVaultRejectURLAndQuery directives you can
"short-circuit" pageVault processing by removing it from the
web server's processing loop at a very early stage in processing,
effectively eliminating pageVault overhead in processing these
requests.
Content-type accept/rejects are the most efficient because
the machine instruction length to compare the targetted content-type
with the response's content-type is very short.
When using accept/rejects based on URLs and URLs and query strings,
follow these guidelines for efficiency:
- use the starting or ending options in
preference to the regexp option
- use the exactCase option in preference to the
anyCase option
- when using the regexp option, make the regular
expression as simple and fast to evaluate as possible
(refer to the guidelines on efficient
regular expression specification)
pageVault can also be enabled/disabled at the virtual server
level.
Create rules for defining content which is "not material"
Some web pages may contain content which is "not material" in
determining whether the "information content" of the page has
changed. For example, a text "hit counter", a salutation
such as "Good morning Jim" or other dynamically generated content
which may slightly personalise the page often would not warrant
the archiving of the page as being a unique response.
pageVault allows you to define rules for a single URL or
matching a set of URL's (based on starting or ending strings
or a regular expression) which specify the start and end markers
of content to be excluded for the purpose of deciding whether the
response is novel and hence needs to be archived.
Refer to the reference manual for the details on
how to do this efficiently.
- Tuning the PageVaultBufferSize parameter
PageVaultBufferSize specifies the amount of memory pageVault
uses per request to buffer the response. The idea here is to
fit "most" of the responses in memory to avoid having to write
a response to disk before it is complete and hence before the
checksum of the response and hence its uniqueness can be determined.
But a large value of this parameter increases memory usage by the
web server.
Refer to the reference manual for a discussion of the tradeoffs in
setting this parameter.
- Tuning the PageVaultHashTableSize parameter
PageVaultHashTableSize specifies the size of the
per-process cache which tracks reponses and their checksums.
A large cache more efficiently identifies and hence winnows out
duplicate responses before they are written to disk and
processed by the pageVault distributor, but also consumes
more memory.
Refer to the reference manual for a discussion of the tradeoffs in
setting this parameter.
- Does pageVault only archive HTML?
No - pageVault will archive any content generated by a web server:
html, text, images, Microsoft Office documents, PDF, sound... All content is
treated as a bytestream by pageVault. However, pageVault can be easily configured
to not archive responses, based on either content type or flexible URL pattern matching.
- Can I perform free-text searches on the archive?
Yes, pageVault now supports free text searches on the archive contents.
- When more than one word is entered, words are ANDed together and the
words may appear in any order unless...
- ... you specify a phrase search by enclosing search words in double quotes, or...
- ... you preceed a word with a hypen, which acts as a "NOT" operator.
- OR searches are not supported.
- All searches are case insensitive.
- The entire contents of a document (including markup) are searched.
- This is really a "string" rather than a word search, in that word boundaries are not
recognised. That is, a search for red will be satisfied by any of these
strings: red redish shredded.
Examples:
- Roman Catholic will match documents containing both of the strings
roman and catholic
- Catholic -roman will match documents containing the string
catholic but which do not contain the string roman
- "10 green bottles" -wall -fall documents containing the string
10 green bottles (as 16 continguous characters) but which do not contain the strings wall or fall
pageVault does not build an inverted list of words in the archive, so that
whilst free-text searching does not increase disk space requirements,
very broad free text searches may take a "significant" amount of time:
- free text searching is very much IO constrained, so the speed you'll
observe is largely dependent on the IO rather than the CPU capabilities
of your archiver hardware. As a guide, the archiver running on
a "commodity" 2GHz Celeron system with 7200 rpm IDE disk under Windows XP
can process about 300 archived files per second (decompress, scan for
text) whilst keeping the CPU about 30% busy.
- hence, unless your archive consists of just a few thousand html files,
it is probably a good idea to restrict the search to a part of your
archive. Eg, if the text you are looking for is in the "mediaRelease"
subdirectory, to provide a url such as www.mysite.com/mediaRelease/*.
- Can pageVault archive FTP responses?
No, pageVault only operates on HTTP requests and responses.
- How should the archiver repository be backed up?
It depends on your organisations preferences and practices. Options include:
- Mirror the disk used to store the repository
- Use RAIDed disk for the repository
- Periodically halt the archiver, use standard full or incremental disk backup software
(commerical or an open source synchronisation tool such as rsync)
and restart the archiver. Note that data to be archived will be
held in a disk-based queue by the distributor and will be stored when the archiver
is restarted.
Depending on demand, a future version of pageVault may contain a
replication facility for automatically maintaining a mirror of the repository on
another system.
- I don't want to have to manage the archive - can you
do it for me?
Yes. Running the archive is not particularly onerous, but you may want
to "outsource" this task to an organisation providing what is effectively
a "notary" service for your web communications.
Please contact us if you'd like to discuss this.
Support, Trial, Licencing
- Who is Project Computing?
Project Computing is a small software house which has been successfully
producing custom and packaged software for
over 20 years.
- Where's the pageVault email group?
The pageVault Discussion
Group is managed / archived by Yahooo Groups.
- Can I trial pageVault?
Yes, a 30 day trial version is available by completing
the trial application form here.
This version is restricted to archiving web pages "harvested" using the
GNU wget utility, but provides a complete demonstration of pageVault
operating on your live web site data.
- What database software does the archiver use and how much maintenance does it require?
The pageVault archiver uses the open-source JDBM B+Tree index, with performance
and operational enhancements implemented by Project Computing and contributed back to
the JDBM project. B+Tree indices are very fast and extremely scalable, and JDBM
is an excellent B+Tree implementation which requires no ongoing maintenance.
- How do I license pageVault? What does it cost?
You can order a license for pageVault by completing this online
order form or by contacting Project Computing.
One license is required on each physical web server machine running pageVault,
regardless of number of web servers, web sites serviced or CPUs.
For further information and pricing, see the pageVault
license page.
- What's coming up?
- Access Log correlation - Version 1.4, Feb 2004
All registered users of Version 1.1, 1.2 and 1.3 qualify for a free upgrade to these versions.
Web-archiving: Managing and
Archiving Online Documents and Records – Monday 25th March, 2002
2nd ECDL Workshop on Web
Archiving
2nd ECDL Workshop on Web Archiving
- report by Michael Day
Towards Continuous Web
Archiving - First Results and an Agenda for the Future, Julien Masanès
Web
Archiving From the Ground Up
'Why Do We Need to Keep This in Print? It's on the Web ...':
a Review of Electronic Archiving Issues and Problems by Dorothy Warner
Archiving Web Resources
- National Archives of Australia
Guidelines
for Electronic Records Management on State and Federal Agency Websites - Charles R. McClure,
J. Timothy Sprehe
Managing Websites Seminar: Gearing up for the e-commerce era
- Australian Society of Archivists Electronic Records Special Interest Group, 1999
Managing CIO Hotline: The CIO's guide to effective records management
- By Debra Logan and Mark Gilbert, Gartner
PageTurner: A large-scale study of the evolution of Web pages - Microsoft Research
The Fading Memory of the State -
By David Talbot, TechnologyReview, July 2005
Web site archiving - an approach to recording every materially different response produced by a website - Kent Fitch, paper presented at AusWeb03
ARCHIVAL PRESERVATION OF SMITHSONIAN WEB RESOURCES: STRATEGIES, PRINCIPLES, AND BEST PRACTICES -
Produced by Dollar Consulting, July 20, 2001 for the Smithsonian Institution
Bibliographies
Archiving Web Resources - an international conference
at the National Library of Australia, 9 - 11 Novemeber 2004
|