Project Computing - Apache Filter FAQ

This FAQ is maintained by Kent Fitch from Project Computing. It contains the information I would have wanted to read when I started to write an Apache Filter.

A large part of learning about Apache Filters is about understanding the Apache Portable Runtime (APR), the life-cycle of an Apache request and Apache's in-memory data structures. This FAQ touches on those areas, pointing you to specific resources that deal with those aspects, and prefering to concentrate on the "Filter specific" issues.

Please send contributions, updates and comments to Kent.Fitch@projectComputing.com. All contributions will be fully attributed.

Last updated: 6 August 2003

Context
How do I get started writing a filter?
Common Problems/Issues
Community
1. What mailing lists should I monitor?

Context
1. What are Apache Filters?
  The Apache version 2 API was rewritten to make Apache much easier to extend. One of the major changes was the introduction of a filter API which allows you to write code which examines and possibly modifies the request data flowing into the web server from the client and the response data flowing back from the web server to the client.
  
  This data may flow through various filters which may transform it in various ways. For example, SSL is implemented in Apache 2 as a filter which deals with the encryption of the request and the response.
2. Why would I want to write a filter?
  Because you want to extend Apache, or change the way it does something, but in manner which coexists with (that is, doesn't break) the rest of the Apache infrastructure (such as SSL).
  
  A very common reason is that you want access to POSTed data. Because there is no limit to the size of data sent from the client to the server, Apache cannot reasonably store this in memory, hooked from the Request data structure. But an input filter provides a simple and standard way to get to this data (and possibly even change it for downstream consumers of the request), even for SSL requests.
3. What's the difference between an Apache "module", a "filter" and a "handler"?
  "Module" is a general term for any code that gets linked with or loaded by Apache and uses the Apache API. "Handler" and "Filter" are 2 subdivisions of this term which describe different types of input and output processors. Simplistically:
  - A Handler generate the response sent back to the client. Each request will be claimed by and "handled" by a single handler.
  - Filters can inspect this response and optionally change the content in various ways: insert content (eg mod_include), encrypt it (mod_ssl), compress it (mod_deflate) or perhaps "chunk" the response differently.
    
    The content generated by the Handler is exposed to the output filters using the Apache Bucket API - the filters get to see it (and operate on it) on its journey through Apache back to the client.
  Handlers register with Apache during their initialisation, and may be specified in the Apache config using the AddHandler or SetHandler directives. Apache will invoke all registered handlers in turn until one "accepts" the responsibility of generating the response to the request.
  
  Similarly, filters register with Apache during their initialisation, and may be specified in the Apache config using the AddInput/OutputFilter or SetInput/OutputFilter directives. But whereas only one handler generates the response, any number of output filters can read or modify it using the Apache Bucket API. And any number of input filters can read and modify the request before the handler sees it (again using the Bucket API).
  
  Examples of handlers can be found in the "generators" and "mappers" subdirectory of the Apache "modules" source directory, and filters in the "filters" subdirectory.
  
  Of course, a single module could be both a filter and a handler - it just has to register with Apache when it wants to be invoked.
4. Is filtering supported in Apache Version 1?
  The filtering architecture discussed here was introduced in Apache 2.
  
  However:
  - if you are running in a Perl environment, consider Apache::Filter
  - As Nick Kew pointed out on the Apache Modules list (30 Nov 2002) an output filter callback mechanism was added to Apache 1.3.24 (quite separate from the mod_ssl EAPI), and as Kent Fitch discussed here (2 Dec 2002), this mechanism has been cleverly used by Gerard Materna's mod_trace_output to provide a filtering capacity. However, how robust this approach is in a general environment (with competing modules, SSL etc) is unknown.
How do I get started writing a filter?
1. What should I read first?
  Here's a list resources to get you started:
  - Ryan Bloom was an architect of the Apache Portable Runtime and Apache 2, so his series on Apache modules and filters are authoratative:
    - Apache Modules
    - Writing Filters for Apache 2.0
    - Writing Input Filters for Apache 2.0 (The link to the source code on this page is currently (Nov02) non-operational, but you can find the source referred to here)
    - Writing Output Filters for Apache 2.0
    - Writing Filters for Apache 2.0 - presentation
    - Apache Portable Run-Time: why? - presentation
  - The Apache Modeling Project Document - Bernhard Gröne, Andreas Knöpfel, Rudolf Kugel, Oliver Schmidt. A wonderful overview of Apache, part of the Apache Modeling Project. See section 3.3, "Extending Apache". (Thanks to sanxius@yahoo.it for the link.)
  - Bucket Brigades: Data Management for Apache 2.0 (pdf) by Cliff Woolley, Apache Runtime Project. (Archived by The Internet Archive - was originally at http://www.apache.org/~jwoolley/bucketbrigades/bucketbrigades.pdf)
  - Apache 2.0 filters (powerpoint presentation) by Greg Ames and Jeff Trawick at ApacheCon 2002
  - Connecting middleware to Apache 2.0 by Uche Ogbuji - "Apache 2.0 has provided many API improvements. Uche Ogbuji gives an example of an Apache 2.0 filter module, and illustrates the new API by example."
  - How filters work in Apache 2.0 - from the Apache documentation site (read this after you understand the basics of filters, or otherwise it won't make much sense). Focusses largely on the types of filters and when they are executed to process the request or response.
  - Apache 2 Tutorials by Threebit.
  - Apache HTTPd Developer Links maintained by Erik Abele.
  - mod_perl - Input and Output Filters Although perl oriented, this discussion of filtering in Apache is worth reading.
2. What's the life-cycle of an Apache request?
  See Request Processing in Apache 2.0 on the Apache documentation site, and for a description of when filters get executed, see How filters work in Apache 2.0.
3. What's the purpose of the Apache Portable Runtime (APR)?
  Prior to the APR, Apache code was littered with platform-determined conditionally compiled code, which made the code hard to read and maintain. In a nutshell, the APR delivers an almost totally uniform API regardless of the run time platform by abstracting away operating system differences.
  
  See An Introduction to APR 2.0 by Christian Gross.
4. What's the best way to find out about the Apache API and data structures?
  To get the best understanding, there is no substitute from reading the code and the examples of modules and filters which come with the Apache distribution.
  
  An extremely useful tool is Doxygen. Just download and install it then run make dox in the top level Apache distribution directory. It will generate very useful hyperlinked documentation about the APR and Apache structures.
  
  The output of running Doxygen on the APR is available here.
  
  The output of running Doxygen on the Apache source is available here.
5. What about the Bucket API? Do I need to understand that?
  Yes, but the good news is that it is well designed and fairly easy to understand. Ryan Bloom's filter articles (above) contain an excellent introduction to the Bucket API.
  
  Cliff Woolley's presentation (also above) is a more in-depth treatment.
Common Problems/Issues
1. How do I set up a filter which can both look at the request and the response?
  1. In your module AP_MODULE_DECLARE_DATA structure, define a "register hooks" callback. This will be invoked as part of your module's initialisation, and gives a chance for your module to register for any of the processing hooks Apache makes available.
  2. One of the hooks you should register for is ap_hook_insert_filter hook, something like this:
```
static ap_filter_rec_t * globalMyInputFilter ;
static ap_filter_rec_t * globalMyOutputFilter ;

...

static void myModuleRegisterHooks(apr_pool_t *p) {

 ap_hook_insert_filter(myModuleInsertFilters, NULL, NULL, APR_HOOK_MIDDLE) ;

 globalMyInputFilter = ap_register_input_filter(myInputFilterName, myInputFilter, 
		NULL, AP_FTYPE_RESOURCE) ;

 globalMyOutputFilter = ap_register_output_filter(myOutputFilterName, myOutputFilter,
		NULL, AP_FTYPE_RESOURCE) ;

}
					
```
    This will result in your function myModuleInsertFilters being invoked on each request. This code also registers the 2 filters, one input and one output, and saves the resultant pointers to Apache filter record structures, which makes the next step performed at a "per request" level more efficient...
  3. Now, on each request, Apache will invoke your myModuleInsertFilters callback code. It has to decide whether it is interested in this request (maybe by looking at the request and configuration data), and if so, add the filters to the request.
    
    Each filter can be associated with a "context", and by setting the same context on both filters, you make it easy for them to share data, status etc.
    
    For example:
```
static void myModuleInsertFilters(request_rec *r) {

  MyPerRequestContext *ctx ;

  MyPerServerConfig *myServerConfig = ap_get_module_config(r->server->module_config, 
			&myFilter_module);

  if(!myServerConfig->enabled) return ;   // some server level enablement switch

  if (.....) {                            // some other "are we interested?" type tests...
    return ;                              // return without doing anything
  }
                                           
  ctx = apr_palloc(r->pool, sizeof(*ctx)) ;  // allocate my "per request" context from the 
                                             // Apache-managed "per request" memory pool
  ctx->status = 0 ;                          // initialise it...
  ....						

  // add the input and output handlers, sharing a context

  ap_add_input_filter_handle(globalMyInputFilter, ctx, r, r->connection) ;
  ap_add_output_filter_handle(globalMyOutputFilter, ctx, r, r->connection) ;
}
```
  Note that because the module is adding itself as an input/output filter, the Apache configuration file directives to add filters should not be used (AddInputFilter, AddOutputFilter, SetInputFilter, SetOutputFilter).
2. Can different filters communicate?
  You can register as many input and output filters with the same context as you which, making communication relatively easy. However, maybe you wish to communicate with completely separate filters. This is only likely to be "possible" or "useful" if the other filters are expecting such communication. However, Apache does provide an API for walking the chain of filters, which is anchored in the request structure:
```
ap_filter_t *  output_filters 
ap_filter_t *  input_filters 
```
  where the ap_filter_t structure contains a pointer to information about the filter (its name, entry-point, type), its context and a chain to the next filter.
  
  The "notes" field in the request is an apr-table which can provide a handy mechanism for storing information associated with a request if you can't share a filter context, as described in this post from Estrade Matthieu.
3. How does the Apache process and thread architecture effect my filter?
  Your filter code should expect to run in any of the Apache Multi-Processing Modules (MPM) environments.
  
  Depending on how Apache has been configured, the process in which your filter is running may be one of many running the Apache server, and may have many threads running in the process. Your filter will be initialised every time a process is created, which may be quite often, again depending in Apache configuration and on server load.
  
  With the multi-threaded MPMs, you must be aware that several threads of execution could be concurrently running in your filter, and hence use interthread locking where appropriate (see Thread Safety, and the APR Thread Mutex and Atomic operations).
  
  But imagine you wish to share a connection pool or heavy-weight structures been instances of your filter across Apache process. In this case you can use the APR's abstraction of shared memory or "backing storage" such as disk files or external databases.
  
  It is a common misconception to think that filters running in separate processes can somehow share static variables! These can (and are) "shared" however between threads, and sometimes access has to be appropriately controlled to prevent unexpected corruption.
4. How do I access the request data POSTed by the client?
  Insert your code as an input filter and use the Bucket API to read any request data.
  
  A common question is "what field in the request structure points to the POSTed request data?". However, there is no such field. Because the length of the POSTed data is unbounded, Apache could not store it in a data structure without limiting its length (and hence breaking HTTP standard compliant applications of the protocol) or risking denial-of-service-based on innocent-error-caused exhaustion of address space resources.
  
  The best way to get access to POSTed data and "play nicely" in the Apache world is to use an input filter. It will also allow you to access SSL encrypted request data (assuming that your filter accesses the input bucket brigade after the SSL input filter, which it almost certainly will).
5. How do I arrange for a handler to be invoked for specific requests?
  First up, handlers and filters are different beasts - handlers are primary content generators whereas filters can inspect and alter content (more on the differences here).
  
  The tricky concept with handlers is that Apache invokes all registered handlers one by one until one of them accepts the responsibility of being the primary content generator. Apache points to a handler name from the request structure. This handler name can be specified by AddHandler/SetHandler directives in the Apache configuration file (see the Apache handler documentation for details).
  So, the first thing a handler should do is check whether it is the "nominated" handler for this particular request, and return DECLINED if not, as soon as possible.
  
  For example, given this configuration:
```
<Location /testLocation>
    SetHandler my-test-handler
</Location>
			
```
  then when processing a request for, say, "/testLocation/test" Apache will set the request handler field to point to the string "my-test-handler". It will then invoke each handler registered with it to see which wants to "claim" the right to generate the response.
  
  So, imagine a module which registers from its "register hooks" entry-point like this:
```
static void my_module_register_hooks(apr_pool_t *p) {
    ap_hook_handler(my_module_handler, NULL, NULL, APR_HOOK_MIDDLE) ;
}
			
```
  Apache will invoke "my_module_handler" for each and every request (unless some other handler of a higher priority (see that last parameter on the "ap_hook_handler" invocation!) is invoked first and claims the request.
  
  So, in the "my_module_handler" you'll want to DECLINE the request (that is, leave it to some other handler) unless the request handler field specifies the "magic string" (in this case, "my-test-handler") which identifies you as the handler:
```
static int my_module_handler(request_rec *r) {
 
    if (!r->handler || strcmp(r->handler, "my-test-handler"))
        return DECLINED ;

    // I've been nominated to handle this request!
    ...

    return OK ;	// means I've handled the request - no other handlers need be asked
}
```
  Is it necessary to test that the handler pointer isn't null? I'm not sure, but better safe than sorry!
Community
1. What mailing lists should I monitor?
  - Apache module developers mailing list
    This list carries lots of discussion about Apache internals and questions and tips about writing modules and filters. Unfortunately, a complete archive does not exist, but postings can be found here (courtesy of Cory Wright) and here (courtesy of MARC/10East).
  - Apache server development mailing list
    This list is really only for Apache developers, but it is worth "listening in", perhaps in digest mode or by viewing the archives (Yahoo Groups) to get a good understanding of the rationale and issues behind the Apache 2 architecture and where it is headed.