Project Computing

pageVault

FAQ

Reference

Trial

License

Screenshots

pageVault
pageVault Reference Manual

Part 2 - Installation


  1. Getting the components

  2. Installing the Archiver

  3. Installing the Apache 2 Filter

    This section describes how to install the pageVault filter for the Apache 2 web sever.

    1. Compiling mod_pageVault

      Because of the great number of platforms across which Apache 2 runs and the number of Apache configuration parameters, binary versions of mod_pageVault are not supplied by Project Computing. However, compiling mod_pageVault from source is very simple using the Apache Dynamic Shared Object (DSO) support.

      DSO support relies on mod_so being statically linked into the Apache core. This module then dynamically loads modules as defined by the web administrator in httpd.conf during Apache initialisation.

      Note: pageVault can be statically linked into the Apache core with some minor performance improvments. However, loading using DSO is so much easier to configure that unless performance is extremely critical, it is recommended that mod_pageVault be loaded via DSO. Contact Project Computing if you wish to statically link mod_pageVault.

      These instructions assume that you have already compiled Apache 2, following the standard Apache 2 installations instructions, and that DSO is enabled. If you are not sure whether DSO is enabled, run the apache executable (in the Apache bin directory) with the "-l" option and check for the presence of the "so" module. If it is not linked in to the Apache core, rebuilt Apache with the "--enable-so" option on the configure command, eg:

      ./configure --prefix /usr/local/apache2 --enable-so
      make
      make install
      

      (The exact configure parameter string will depend on your local requirements; refer to the Apache documentation for full details.)

      The mod_pageVault module directory can be placed anywhere convenient for you. As a suggestion, copy the contents of the pageVault apache filter source directory to a new directory in your apache2 source tree, eg: /usr/local/src/apache2/modules/pageVault. Then, use the Apache apxs tool to compile the mod_pageVault.c source file. Assuming your apache2 binary directory is /usr/local/apache2/bin, then an appropriate apxs command is:

      /usr/local/apache2/bin/apxs -i -A -c mod_pageVault.c
      

      The -A flag creates a commented out LoadModule directive for mod_pageVault in the http.conf file. Refer to the Apache documentation for apxs for more information.

      The result of the apxs command should be to create the mod_pageVault.so file in your apache2 modules directory.

    2. Configuring mod_pageVault

      Configuration of the pageVault filter is performed in the standard Apache httpd.conf file. Assuming that the module is being loaded by DSO, the apxs command described above should have created a commented out LoadModule directive, which you should now uncomment:

      LoadModule pageVault_module   modules/mod_pageVault.so
      

      Refer to the Filter Parameter reference for further details.

  4. Installing the Microsoft IIS Filter

    1. Define the pageVault ISAPI parameters

      A sample set of parameters is defined in the supplied iisFilterParms.txt. The contents of this file are suitable for an initial trial and exploration of pageVault.

      This file can be updated at any time. However, the parameters are only read by the pageVault ISAPI filter when it initialises, and hence the web service must be recycled (stopped and restarted) for changes to take effect.

      The parameter file can be located in any directory readadable from the IIS server. The location of the parameter file is defined in a registry key, as described in the next point.

      Take particular care to ensure that the PageVaultDataDirectory and PageVaultControlDirectory parameters are defined to match the corresponding values in the distributor parameters.

      Here is the sample (default) ISAPI filter parameter file:

      ## demo pageVault IIS Filter parameter file
      #
      # This file must be pointed to from the registry key:
      # HKEY_LOCAL_MACHINE\SOFTWARE\Project Computing\pageVault\1.0, value ParmFile
      # The pageVault utility program "setPVParm" can be used to set and inspect this value.
      #
      #
      # This file is read by the IIS pageVault Filter.  The pageVault filter must be installed
      # as a web server (global) level filter, not as a web site filter.
      
      # The name of the file to which pageVault will log initialisation messages and debugging info
      
      PageVaultLogFile C:\PAGEVAULT\FILTERLOG.TXT   
      
      # Enables the pageVault filter.  Values: on or off
      PageVaultEnable on
      
      # The debugging level.  Set to 0 for normal operation.  Values: 0 - 9
      PageVaultDebugLevel 0
      
      # PageVaultBufferSize is the size of the in-memory buffer used to cache responses. 
      # Responses larger than this size must be staged to disk which will increase the
      # overhead of pageVault.  This parameter represets a tradeoff between memory
      # consumption and CPU/IO.  Must be greater the 2000.
      # A value of 60000 is recommended.
      PageVaultBufferSize 60000
      
      # PageVaultHashTableSize is the size of the hash table used to record the checksums
      # of recent responses and hence discard them as duplicates.  The large the hashtable,
      # the earlier duplicates can be detected and hence the less the overhead.  Another
      # tradeoff between memory and CPU/IO.  Must be between 1000 - 10000. 
      # A value of 2047 is recommended.
      PageVaultHashTableSize 2047
      
      # PageVaultDataDirectory specifies the full path name of the directory
      # that contains the data files produced by the pageVault filter, being
      # the possibly novel HTTP responses.  No trailing slash.
      # ***** The name of this directory must be configured to the pageVault Distributor
      # ***** (see DistributorParms.xml) ************
      PageVaultDataDirectory C:\PAGEVAULT\DATA
      
      # PageVaultControlDirectory specifies the full path name of the directory
      # that contains the control files produced by the pageVault filter, being
      # the possibly novel HTTP responses.  No trailing slash.
      # *****  The name of this directory must be configured to the pageVault Distributor
      # ***** (see DistributorParms.xml) ************
      PageVaultControlDirectory C:\PAGEVAULT\CONTROL
      
      # PageVault Accept and Reject URL and ContentType rules  specifies a list of accept and
      # reject rules that define which responses will be processed and which will be
      # ignored by the filter. 
      # 
      # Accept/reject rules are tested in the order they appear.
      # The first rule to match the URL/contentType being processed is applied.
      # If no rules are supplied, all content is accepted.
      # If no rules have been matched then if the last rule was an Accept, the response
      # will be rejected; if the last rule was a Reject, the response will be accepted.
      
      # Format is: equals|starting|ending|regexp  exactCase|anyCase  url-match string
      ##PageVaultAcceptURL ending anyCase .asp
      ##PageVaultAcceptURL ending anyCase .html
      ##PageVaultAcceptURL ending anyCase .gif
      ##PageVaultAcceptURL ending anyCase .pdf
      ##PageVaultRejectURL ending anyCase .pdf
      ##PageVaultRejectURL ending anyCase .doc
      ##PageVaultAcceptURLAndQuery regexp anyCase .*pany/index.html.*
      ##PageVaultAcceptURLAndQuery regexp anyCase \/cgi-bin\/Search\?.*poison.*
      ##PageVaultAcceptContentType text/html
      ##PageVaultRejectContentType application/
      
      # Content to exclude from has calculation.
      # These rules are only appplied to responses having a content-type of text/html.
      # The urlPattern are matched against NORMALISED urls (ie, lower case).
      #
      #
      
      # define target URL for an exclude-content-from-checksum set definition: setname equals|starting|ending|regexp exactCase|anyCase criteria
      ##PageVaultDefineExcludeContentTargetURL test1 starting anyCase /index.html
      
      # define content to be excluded: setname all|first "start" criteria "end" criteria
      ##PageVaultDefineExcludeContentExpression test1 all start User: end <
      ##PageVaultDefineExcludeContentExpression test1 all start Task: end x<
      
    2. Define the location of the pageVault ISAPI parameter file

      The setPVParm program distributed with pageVault defines the name of the parameter file which is read at initialisation time by the pageVault ISAPI filter.

      You should run the setPVParm program before defining the pageVault ISAPI filter to IIS.

      It should be run with one parameter: the full path and file name of the pageVault ISAPI filter parameter file.

      Eg:

      	C:\pageVault>setPVparm c:\pageVault\deploy\parms\iisFilterParms.txt
      

      The program should echo a short description:

      	PageVault setPVParm - set or display the registry key containing the location
      	of the pageVault Filter parameter file
      

      then the name of the registry key being set:

      	Registry key: SOFTWARE\Project Computing\pageVault\1.0
      

      then the value of the key (the name of the parameter file):

      	Setting key to new value: c:\pageVault\deploy\parms\iisFilterParms.txt
      

      then a "success" message:

      	Key successfully set
      

      and finally report that the filename set does exist and can be read:

      	Parameter file exists and can be opened for reading
      

      You can run setPVparm at any time to show the value of the registry key. Of course, you can also edit the registry directly; this program is merely a tool of convenience.

    3. Define the pageVault ISAPI filter to IIS

      The supplied PageVault.dll program must be defined to the IIS web service as a global filter (not a web site filter). That is, the PageVault ISAPI is installed as a service level rather than site level filter.

      Use the Microsoft IIS Manage/administration to open the "web sites" properties and select the ISAPI filters tab. Click "add" and then "browse" to select the PageVault.dll and then click OK. The PageVault.dll file can be placed in any directory.

      Restart the IIS service and the pageVault filter should be operational.

      The Microsoft Event viewer should show initialisation messages from pageVault. Also, the log file (defined in the pageVault ISAPI parameter file) should record the successful initialisation of the filter.

      As requests are handled by the web sites, the pageVault ISPAI filter will start writing responses to the data and control directories (defined in the pageVault ISAPI parameter file).

  5. Installing the GNU wget Filter

  6. Installing the Distributor

    Distributing to multiple archives

    Some sites may find it convenient to configure a single distributor to send content to different archives based on the URL of the request.

    This can be achieved by defining multiple archiverQueue elements within the archiverQueues element, each with their own list of acceptPattern elements which define a regular expression to match against the url being archived. For example:

    <archiverQueues> <archiverQueue hostName="pv-archiver1.sample.com" hostIPAddress="10.10.18.56" port="8071" connectTo="true"> <acceptPattern>^www.sample.com/.*</acceptPattern> <acceptPattern>^sample.com/.*</acceptPattern> </archiverQueue> <archiverQueue hostName="pv-archiver1.sample.com" hostIPAddress="10.10.18.56" port="8171" connectTo="true"> <acceptPattern>^www-test.sample.com/public/.*</acceptPattern> </archiverQueue> <archiverQueue hostName="pv-archiver2.sample.com" hostIPAddress="10.10.18.57" port="8071" connectTo="true"> <acceptPattern>.*/intranet/.*</acceptPattern> </archiverQueue> </archiverQueues>

    In this example:

    • the URL is first tested to see if it starts with www.sample.com/ or sample.com/ and if so is directed to the archiver running on pv-archiver1.sample.com:8071
    • otherwise, the URL is tested to see if starts with www-test.sample.com/public and if so is directed to the archiver running on pv-archiver1.sample.com:8171
    • otherwise, the URL is tested to see if it contains the string /intranet/ and if so is directed to the archiver running on pv-archiver2.sample.com:8071
    • if no regular expression matches the URL then it is sent to the first defined archiverQueue

    Notes:

    1. The tests are applied in the order in which the archiverQueue elements appear with the archiverQueues element in the parameter file

    2. The tested URL includes the hostname (and :portnumber if the portnumber is not 80)

    3. The regular expression is always applied in "case insensitive" mode. That is, WWW.SAMPLE.COM will be matched by www.sample.com

    4. if no regular expression matches the URL then it is sent to the first defined archiverQueue

    5. Some versions of the sample distributorParms.xml file erroneously imply that the hostname is not part of the URL matched by the regular expression in the acceptPattern element.

  7. Installing the Viewer

    Supporting multiple archives

    Large sites may find it convenient to store content across multiple archives, eg, one for public web sites, one for a transaction based web site, another for the intranet sites.

    Although it is simple to create one viewer per archive, it may often be more convenient to have all archives accessible from a single viewer, as described here:

    1. Create a DNS alias for the machine on which the viewer is running for each pageVault archive you wish to access. For example, "pv-extranet.sample.com", "pv-intranet.sample.com", "pv-public.sample.com". Each name is bound to the same physical machine/IP address - that of the machine running the pageVault viewer.

    2. Edit Tomcat's conf/server.xml file to define each of these names as aliases of the main host entry. Eg:

      <Host name="localhost" debug="0" appBase="c:/pageVault/deploy" unpackWARs="false" autoDeploy="false" liveDeploy="false"> <Alias>pv-extranet.sample.com</Alias> <Alias>pv-intranet.sample.com</Alias> <Alias>pv-public.sample.com</Alias> ...

    3. Edit the pageVault viewerParms configuration file to define the archiverQueryListener element for each archive to which the viewer should connect. For example, assuming that:

      • the extranet archive is running on machine pv-archiver1, listening for viewer connections on port 8073
      • the intranet archive is running on the same machine, listening for viewer connections on port 8173
      • the public web pages archive is running on machine pv-archiver2, listening for viewer connections on port 8073
      then the following definitions would be appropriate: <archiverQueryListener hostIPAddress="pv-archiver1.sample.com" port="8073"> <bindToVirtualHost>pv-extranet.sample.com</bindToVirtualHost> </archiverQueryListener> <archiverQueryListener hostIPAddress="pv-archiver1.sample.com" port="8173"> <bindToVirtualHost>pv-intranet.sample.com</bindToVirtualHost> </archiverQueryListener> <archiverQueryListener hostIPAddress=""pv-archiver2.sample.com" port="8073"> <bindToVirtualHost>pv-public.sample.com</bindToVirtualHost> </archiverQueryListener>

    4. Then, assuming the Tomcat running the Viewer is listening for HTTP connections on port 8080,

      • to access the extranet acrhive, you'd browse to http://pv-extranet.sample.com:8080/pv
      • to access the intranet acrhive, you'd browse to http://pv-intranet.sample.com:8080/pv
      • to access the public acrhive, you'd browse to http://pv-public.sample.com:8080/pv

  8. Testing the installation

  9. Filter parameter reference

    PageVaultEnable
    Values: on, off
    Default: on
    Specifies whether pageVault should be enabled or disabled at the server or virtual server level
    PageVaultDebugLevel
    Values: integer between 0 and 9
    Default: 0
    Specifies the debug message level. Set to 0 for no debug messages, 9 for maximum debugging. Set to 0 for minimum overhead.
    PageVaultBufferSize
    Values: integer at least 2000
    Default: 49152
    Specifies the size of the per-request memory buffer used to accumulate the response data. When the response exceeds this size, pageVault must retain the response in a temporary file, because it is only at the end of the response that the uniqueness of this response can be assessed. Hence, setting a small value of this parameter (smaller than frequently encountered response sizes) increases the number of temporary files created/written/deleted by the pageVault filter. Setting a large value increases per-response memory requirements but reduces temporary file I/O.
    PageVaultHashTableSize
    Values: integer between 1000 and 10000
    Default: 511
    Specifies the size of the hash table the filter uses for early detection of duplicate responses. Each hash table entry consumes 20 bytes of memory. For Apache users, one copy of the hashtable is allocated per child process. Threads within a process share the same hashtable.

    The pageVault filter attempts to detect and ignore duplicate request/response pairs as early as possible by maintaining a hashtable of already seen responses. Only if the current response is not in the hashtable will it be handed to the pageVault distributor for further processing. The hashtable is just the "first line" in duplicate detection - the distributor and archiver perform increasingly expensive duplicate detection.

    This parameter will be automatically adjusted to 1 less than a power of 2 not greater than the supplied parameter (to provide efficient hash key distribution over the table). Hence, if a parameter value of "2000" were supplied, it would be adjusted to 1023, and "2048" would be adjusted to "2047".

    PageVaultDataDirectory
    PageVaultControlDirectory
    Values: pathname
    Default: none - must be supplied
    Specifies the directories which will contain control and data files written by the pageVault filter. These files will be read and deleted by the pageVault distributor. and so the value of the setting for this parameter must match that set in the pageVault distributor parameter definitions. The contents of this directory must be writable by the pageVault filter and the pageVault distributor. The filter will create the path to this directory if it does not already exist. The paths should not be the same because the distributor works most efficiently when data and control files are in separate directories. However, it would be typical for both parameters to share a common parent path, eg:
    PageVaultDataDirectory /usr/local/pageVault/filter/data
    PageVaultControlDirectory /usr/local/pageVault/filter/control
    
    PageVaultAcceptContentType
    PageVaultRejectContentType
    Values: complete or partial mime type
    Default: none
    Specifies a mime type of a response to be accepted or rejected for further processing. The pageVault filter attempts to perform a match on the mime type of the response and values supplied for each occurrence of this parameter. Only the number of characters provided in the paremeter value are compared, allowing for (trailing) wildcard matching. For example, given these values:
    PageVaultAcceptContentType text/html
    PageVaultRejectContentType image/tif
    PageVaultAcceptContentType image/
    PageVaultAcceptContentType application/ms-word
    PageVaultRejectContentType application/
    

    then the content-type of the response will be compared against each rule in turn in the order in which they appear. Other Accept/Reject rules (URL and URLand Query) may be interspersed and will be processed as one sequence, in the order in which they appear.

    So, the first comparison in this example will be between the content type of the response and the string "text/html". If the first 9 characters of the response's content type matches, the response will be accepted, and no further Accept/Reject rules will be applied. If the content-type does not start with this string, the first 9 characters will then be compared with "image/tif", and if a match occurs, the response will be rejected and no further Accept/Reject rules will be applied. Otherwise, the comparisons will continue until the first Accept or Reject is matched.

    If no Accept or Reject rules are matched, then if the final rule tested was a Reject, the response will be accepted, whereas if the final rule tested was an Accept, the response will be rejected.

    If no Accept or Reject parameters of any type are supplied then all responses are automatically accepted.

    PageVaultAcceptURL
    PageVaultRejectURL
    PageVaultAcceptURLAndQuery
    PageVaultRejectURLAndQuery
    Values: equals|starting|ending|regexp anyCase|exactCase urlPattern
    Default: none
    Specifies a rule for matching the url (or, if the ANDQUERY form is used, url and query string) of the response to determine whether the response should be accepted or rejected for further processing.

    [See the PageVaultAcceptContentType parameter description for general information on the ordering of Accept/Reject rules.]

    Each rule must contain three parameter values:

    1. the type of matching, one of:
      • equals: the url of the response must equal the urlPattern in its entirity
      • starting: the url of the response must start with the characters in the urlPattern
      • ending: the url of the response must end with the characters the urlPattern
      • regexp: the url match match the regular expression provided in the urlPattern
    2. whether string matching is case specific or non-specific (exactCase/anyCase respectively)
    3. the urlPattern; for regexp matching, this must be a regular expression

    For example:

    PageVaultAcceptURL ending anyCase .html
    PageVaultAcceptURL ending anyCase .gif
    PageVaultRejectURL ending anyCase .pdf
    PageVaultRejectURLAndQuery regexp exactCase /cgi-bin/searchResults.*corporate.*
    PageVaultAcceptURL starting exactCase /cgi-bin/
    

    In this example, the URL of the response will first be checked to see if it ends in ".html" (or ".HtMl", etc), and if so, the request will be accepted. Otherwise, it will be checked to see whether it ends in ".gif" (and case variants), and if so, will be accepted. Otherwise, it will be checked to see whether it ends in ".pdf" (and case variants), and if so, will be rejected. Otherwise, the URL and Query strings will be checked against the regular expression "/cgi-bin/searchResults.*corporate.*", and if it matches, it will be rejected. Otherwise, it the first 9 characters of the request URL will be checked against the string "/cgi-bin", and if a match occurs the request will be accpted. Otherwise, finally, the request will be rejected.

    The "URLAndQuery" versions match against the base URL followed by a "?" and the data supplied, whether by a GET or a POST, eg:

    /cgi-bin/getQuote?id=PYT&period=June
    

    Performance notes

    • exactCase is faster than anyCase
    • regexp matches are the slowest and should only be used when necessary. "Unanchored" regexp (those that don't have fixed beginning or ending strings, such as ".*search.*") are slowest of all.
    • The "URL" versions of these rules are faster than their "URLAndQuery" equivalents

 
Project Computing Pty Ltd    ACN: 008 590 967 contact@projectComputing.com