![]() |
| pageVault FAQ Reference Trial License Screenshots |
pageVault Reference Manual
Part 2 - Installation
|
This section describes how to install the pageVault filter for the Apache 2 web sever.
Because of the great number of platforms across which Apache 2 runs and the number of Apache configuration parameters, binary versions of mod_pageVault are not supplied by Project Computing. However, compiling mod_pageVault from source is very simple using the Apache Dynamic Shared Object (DSO) support.
DSO support relies on mod_so being statically linked into the Apache core. This module then dynamically loads modules as defined by the web administrator in httpd.conf during Apache initialisation.
Note: pageVault can be statically linked into the Apache core with some minor performance improvments. However, loading using DSO is so much easier to configure that unless performance is extremely critical, it is recommended that mod_pageVault be loaded via DSO. Contact Project Computing if you wish to statically link mod_pageVault.
These instructions assume that you have already compiled Apache 2, following the standard Apache 2 installations instructions, and that DSO is enabled. If you are not sure whether DSO is enabled, run the apache executable (in the Apache bin directory) with the "-l" option and check for the presence of the "so" module. If it is not linked in to the Apache core, rebuilt Apache with the "--enable-so" option on the configure command, eg:
./configure --prefix /usr/local/apache2 --enable-so make make install
(The exact configure parameter string will depend on your local requirements; refer to the Apache documentation for full details.)
The mod_pageVault module directory can be placed anywhere convenient for you. As a suggestion, copy the contents of the pageVault apache filter source directory to a new directory in your apache2 source tree, eg: /usr/local/src/apache2/modules/pageVault. Then, use the Apache apxs tool to compile the mod_pageVault.c source file. Assuming your apache2 binary directory is /usr/local/apache2/bin, then an appropriate apxs command is:/usr/local/apache2/bin/apxs -i -A -c mod_pageVault.c
The -A flag creates a commented out LoadModule directive for mod_pageVault in the http.conf file. Refer to the Apache documentation for apxs for more information.
The result of the apxs command should be to create the mod_pageVault.so file in your apache2 modules directory.
Configuration of the pageVault filter is performed in the standard Apache httpd.conf file. Assuming that the module is being loaded by DSO, the apxs command described above should have created a commented out LoadModule directive, which you should now uncomment:
LoadModule pageVault_module modules/mod_pageVault.so
Refer to the Filter Parameter reference for further details.
A sample set of parameters is defined in the supplied iisFilterParms.txt. The contents of this file are suitable for an initial trial and exploration of pageVault.
This file can be updated at any time. However, the parameters are only read by the pageVault ISAPI filter when it initialises, and hence the web service must be recycled (stopped and restarted) for changes to take effect.
The parameter file can be located in any directory readadable from the IIS server. The location of the parameter file is defined in a registry key, as described in the next point.
Take particular care to ensure that the PageVaultDataDirectory and PageVaultControlDirectory parameters are defined to match the corresponding values in the distributor parameters.
Here is the sample (default) ISAPI filter parameter file:
## demo pageVault IIS Filter parameter file # # This file must be pointed to from the registry key: # HKEY_LOCAL_MACHINE\SOFTWARE\Project Computing\pageVault\1.0, value ParmFile # The pageVault utility program "setPVParm" can be used to set and inspect this value. # # # This file is read by the IIS pageVault Filter. The pageVault filter must be installed # as a web server (global) level filter, not as a web site filter. # The name of the file to which pageVault will log initialisation messages and debugging info PageVaultLogFile C:\PAGEVAULT\FILTERLOG.TXT # Enables the pageVault filter. Values: on or off PageVaultEnable on # The debugging level. Set to 0 for normal operation. Values: 0 - 9 PageVaultDebugLevel 0 # PageVaultBufferSize is the size of the in-memory buffer used to cache responses. # Responses larger than this size must be staged to disk which will increase the # overhead of pageVault. This parameter represets a tradeoff between memory # consumption and CPU/IO. Must be greater the 2000. # A value of 60000 is recommended. PageVaultBufferSize 60000 # PageVaultHashTableSize is the size of the hash table used to record the checksums # of recent responses and hence discard them as duplicates. The large the hashtable, # the earlier duplicates can be detected and hence the less the overhead. Another # tradeoff between memory and CPU/IO. Must be between 1000 - 10000. # A value of 2047 is recommended. PageVaultHashTableSize 2047 # PageVaultDataDirectory specifies the full path name of the directory # that contains the data files produced by the pageVault filter, being # the possibly novel HTTP responses. No trailing slash. # ***** The name of this directory must be configured to the pageVault Distributor # ***** (see DistributorParms.xml) ************ PageVaultDataDirectory C:\PAGEVAULT\DATA # PageVaultControlDirectory specifies the full path name of the directory # that contains the control files produced by the pageVault filter, being # the possibly novel HTTP responses. No trailing slash. # ***** The name of this directory must be configured to the pageVault Distributor # ***** (see DistributorParms.xml) ************ PageVaultControlDirectory C:\PAGEVAULT\CONTROL # PageVault Accept and Reject URL and ContentType rules specifies a list of accept and # reject rules that define which responses will be processed and which will be # ignored by the filter. # # Accept/reject rules are tested in the order they appear. # The first rule to match the URL/contentType being processed is applied. # If no rules are supplied, all content is accepted. # If no rules have been matched then if the last rule was an Accept, the response # will be rejected; if the last rule was a Reject, the response will be accepted. # Format is: equals|starting|ending|regexp exactCase|anyCase url-match string ##PageVaultAcceptURL ending anyCase .asp ##PageVaultAcceptURL ending anyCase .html ##PageVaultAcceptURL ending anyCase .gif ##PageVaultAcceptURL ending anyCase .pdf ##PageVaultRejectURL ending anyCase .pdf ##PageVaultRejectURL ending anyCase .doc ##PageVaultAcceptURLAndQuery regexp anyCase .*pany/index.html.* ##PageVaultAcceptURLAndQuery regexp anyCase \/cgi-bin\/Search\?.*poison.* ##PageVaultAcceptContentType text/html ##PageVaultRejectContentType application/ # Content to exclude from has calculation. # These rules are only appplied to responses having a content-type of text/html. # The urlPattern are matched against NORMALISED urls (ie, lower case). # # # define target URL for an exclude-content-from-checksum set definition: setname equals|starting|ending|regexp exactCase|anyCase criteria ##PageVaultDefineExcludeContentTargetURL test1 starting anyCase /index.html # define content to be excluded: setname all|first "start" criteria "end" criteria ##PageVaultDefineExcludeContentExpression test1 all start User: end < ##PageVaultDefineExcludeContentExpression test1 all start Task: end x<
The setPVParm program distributed with pageVault defines the name of the parameter file which is read at initialisation time by the pageVault ISAPI filter.
You should run the setPVParm program before defining the pageVault ISAPI filter to IIS.
It should be run with one parameter: the full path and file name of the pageVault ISAPI filter parameter file.
Eg:
C:\pageVault>setPVparm c:\pageVault\deploy\parms\iisFilterParms.txt
The program should echo a short description:
PageVault setPVParm - set or display the registry key containing the location of the pageVault Filter parameter file
then the name of the registry key being set:
Registry key: SOFTWARE\Project Computing\pageVault\1.0
then the value of the key (the name of the parameter file):
Setting key to new value: c:\pageVault\deploy\parms\iisFilterParms.txt
then a "success" message:
Key successfully set
and finally report that the filename set does exist and can be read:
Parameter file exists and can be opened for reading
You can run setPVparm at any time to show the value of the registry key. Of course, you can also edit the registry directly; this program is merely a tool of convenience.
The supplied PageVault.dll program must be defined to the IIS web service as a global filter (not a web site filter). That is, the PageVault ISAPI is installed as a service level rather than site level filter.
Use the Microsoft IIS Manage/administration to open the "web sites" properties and select the ISAPI filters tab. Click "add" and then "browse" to select the PageVault.dll and then click OK. The PageVault.dll file can be placed in any directory.
Restart the IIS service and the pageVault filter should be operational.
The Microsoft Event viewer should show initialisation messages from pageVault. Also, the log file (defined in the pageVault ISAPI parameter file) should record the successful initialisation of the filter.
As requests are handled by the web sites, the pageVault ISPAI filter will start writing responses to the data and control directories (defined in the pageVault ISAPI parameter file).
Distributing to multiple archives
Some sites may find it convenient to configure a single distributor to send content to different archives based on the URL of the request.
This can be achieved by defining multiple archiverQueue elements within the archiverQueues element, each with their own list of acceptPattern elements which define a regular expression to match against the url being archived. For example:
In this example:
Notes:
The tests are applied in the order in which the archiverQueue elements appear with the archiverQueues element in the parameter file
The tested URL includes the hostname (and :portnumber if the portnumber is not 80)
The regular expression is always applied in "case insensitive" mode. That is, WWW.SAMPLE.COM will be matched by www.sample.com
if no regular expression matches the URL then it is sent to the first defined archiverQueue
Some versions of the sample distributorParms.xml file erroneously imply that the hostname is not part of the URL matched by the regular expression in the acceptPattern element.
Large sites may find it convenient to store content across multiple archives, eg, one for public web sites, one for a transaction based web site, another for the intranet sites.
Although it is simple to create one viewer per archive, it may often be more convenient to have all archives accessible from a single viewer, as described here:
Create a DNS alias for the machine on which the viewer is running for each pageVault archive you wish to access. For example, "pv-extranet.sample.com", "pv-intranet.sample.com", "pv-public.sample.com". Each name is bound to the same physical machine/IP address - that of the machine running the pageVault viewer.
Edit Tomcat's conf/server.xml file to define each of these names as aliases of the main host entry. Eg:
Edit the pageVault viewerParms configuration file to define the archiverQueryListener element for each archive to which the viewer should connect. For example, assuming that:
Then, assuming the Tomcat running the Viewer is listening for HTTP connections on port 8080,
The pageVault filter attempts to detect and ignore duplicate request/response pairs as early as possible by maintaining a hashtable of already seen responses. Only if the current response is not in the hashtable will it be handed to the pageVault distributor for further processing. The hashtable is just the "first line" in duplicate detection - the distributor and archiver perform increasingly expensive duplicate detection.
This parameter will be automatically adjusted to 1 less than a power of 2 not greater than the supplied parameter (to provide efficient hash key distribution over the table). Hence, if a parameter value of "2000" were supplied, it would be adjusted to 1023, and "2048" would be adjusted to "2047".
PageVaultDataDirectory /usr/local/pageVault/filter/data PageVaultControlDirectory /usr/local/pageVault/filter/control
PageVaultAcceptContentType text/html PageVaultRejectContentType image/tif PageVaultAcceptContentType image/ PageVaultAcceptContentType application/ms-word PageVaultRejectContentType application/
then the content-type of the response will be compared against each rule in turn in the order in which they appear. Other Accept/Reject rules (URL and URLand Query) may be interspersed and will be processed as one sequence, in the order in which they appear.
So, the first comparison in this example will be between the content type of the response and the string "text/html". If the first 9 characters of the response's content type matches, the response will be accepted, and no further Accept/Reject rules will be applied. If the content-type does not start with this string, the first 9 characters will then be compared with "image/tif", and if a match occurs, the response will be rejected and no further Accept/Reject rules will be applied. Otherwise, the comparisons will continue until the first Accept or Reject is matched.
If no Accept or Reject rules are matched, then if the final rule tested was a Reject, the response will be accepted, whereas if the final rule tested was an Accept, the response will be rejected.
If no Accept or Reject parameters of any type are supplied then all responses are automatically accepted.
[See the PageVaultAcceptContentType parameter description for general information on the ordering of Accept/Reject rules.]
Each rule must contain three parameter values:
For example:
PageVaultAcceptURL ending anyCase .html PageVaultAcceptURL ending anyCase .gif PageVaultRejectURL ending anyCase .pdf PageVaultRejectURLAndQuery regexp exactCase /cgi-bin/searchResults.*corporate.* PageVaultAcceptURL starting exactCase /cgi-bin/
In this example, the URL of the response will first be checked to see if it ends in ".html" (or ".HtMl", etc), and if so, the request will be accepted. Otherwise, it will be checked to see whether it ends in ".gif" (and case variants), and if so, will be accepted. Otherwise, it will be checked to see whether it ends in ".pdf" (and case variants), and if so, will be rejected. Otherwise, the URL and Query strings will be checked against the regular expression "/cgi-bin/searchResults.*corporate.*", and if it matches, it will be rejected. Otherwise, it the first 9 characters of the request URL will be checked against the string "/cgi-bin", and if a match occurs the request will be accpted. Otherwise, finally, the request will be rejected.
The "URLAndQuery" versions match against the base URL followed by a "?" and the data supplied, whether by a GET or a POST, eg:
/cgi-bin/getQuote?id=PYT&period=June
Performance notes
| Project Computing Pty Ltd ACN: 008 590 967 | contact@projectComputing.com |