Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103^rd Street, Indianapolis, IN 46290 or at support@mcp.com.
Notice: This material is excerpted from Running A Perfect Web Site with Apache, ISBN: 0-7897-0745-4. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 05 - Apache Configuration

By this point you should have a running, minimal web server. In this chapter, you learn about most of the functionality that comes bundled with the server. This chapter is organized as a series of tutorials, so that new users can get up to speed. Toward the end of the chapter, you dive into some experimental Apache modules as well.

By the time you read this chapter, given the rapid pace of development, there will be some significantly new functionality implemented and released. However, the existing functionality is not likely to change much. The Apache Group has had a strong ethic toward backward compatibility.

In this chapter you will learn how to:

Configure the MIME types of objects on the server
Use those MIME types to trigger special actions
Redirect and alias requests for different parts of your site
Configure directory indexing
Set up and use server-side includes
Set up internal imagemap handling
Use "cookies" to track user sessions
Set up configurable logging
Turn on and use content negotiation
Configure access control based on hostnames and IP numbers, or passwords.
Configure "virtual" hosts.
Customize the server's error messages.

In short, this should cover most of the major functionality of Apache 1.0.

Configuration Basics

The "srm.conf" (also known as the "ResourceConfig" file, which is a directive that can be set in httpd.conf) and "access.conf" (also known as the "AccessConfig" file, also a directive in httpd.conf) files are where most of the configuration related to the actual objects on the server takes place. The names are mostly historical - at one point, when the server was still NCSA, the only thing "access.conf" was good for was setting permissions, restrictions, authentication, and so forth. Then, when directory indexing was added, the cry went out for the capability to control certain characteristics on a directory-by-directory basis. The "access.conf" file was the only one that had any kind of structure for that: the pseudo-htmL "<Directory>" container.

With Apache's revamped configuration file parsing routines, most directives can literally appear anywhere. For example, within "<Directory>" containers in access.conf, within "<VirtualHost>" containers in httpd.conf, and so on. However, for sanity's sake, you should keep some structure to the configuration files. You should put server-processing-level configuration options in httpd.conf (like "Port," "<VirtualHost>" containers, etc.), put generic server resource information in "srm.conf" (like "Redirect," "AddType," directory indexing information, etc.), and per-directory configurations in "access.conf."

In addition to the "<Directory>" container, there is the "<Limit>" container, which is used within "<Directory>" containers to specify certain HTTP methods to which particular directives apply. Examples will be given later in this chapter.

Per-Directory Configuration Files

Before you get too deep into the long list of features, take a look at a mechanism that controls most of those features on a directory-by-directory basis by using a file in that directory itself. You can already control subdirectory options in access.conf, as outlined in the previous chapter. However, for a number of reasons, you may want to allow these configurations to be maintained by people other than those who have the power to restart the server (such as people maintaining their home pages), and for that purpose the "AccessFileName" directive was invented.

The default "AccessFileName" is ".htaccess." If you want to use something else, for example ".acc," you would say the following in the srm.conf file:

     AccessFileName .acc

If looking for this file is enabled, and a request comes in that translates to the file /www/htdocs/path/path2/file, the server will look for /.acc, /www/.acc, /www/htdocs/.acc, /www/htdocs/path/.acc, and /www/htdocs/path/path2/.acc, in that order. Also, it will parse the file if it finds it to see what configuration options apply. Remember that this parsing has to happen with each hit, separately, so this can be a big performance hit. If you turn it off by setting the following in your access config file:

     <Directory />

     AllowOverride None

     </Directory>

For the sake of brevity and clarity, let's call these files ".htaccess" files. What options can these files affect? The range of available options is controlled by the "AllowOverride" directive within the <Directory> container in the AccessConfig file, as mentioned previously. The exact arguments to "AllowOverride" are as follows:

Argument Result
AuthConfig When listed, ".htaccess" can specify their own authentication directives, such as "AuthUserFile," "AuthName," "AuthType," "require," and so on.
FileInfo When listed, ".htaccess" can override any settings for metainformation about files, using directives such as "AddType," "AddEncoding," "AddLanguage," and so forth.
Indexes When listed, ".htaccess" files can locally set directives that control the rendering of the directory indexing, as implemented in the module "mod_dir.c." For example, "FancyIndexing," "AddIcon," "AddDescription," and the like.

Limit Allow the use of the directives that limit access based on hostname or host IP number (allow, deny, and order).

Options Allow the use of the "Options" directive.

All Allow all of the above to be true.
< "AllowOverride" options are not merged, which means that if the configuration for "/path/" is different than the configuration for "/", the "/path/" one will take precedence because it's "deeper."

Mime Types: AddType and AddEncoding

A fundamental element of the HTTP protocol, and the reason why the Web was so natural as a home for multiple media formats, is that every data object transferred through HTTP had an associated MIME type. What does this mean?

MIME stands for Multipurpose Internet Mail Extensions, and its origins lie in an effort to standardize the transmission of documents of multiple media through e-mail. Part of the MIME specification was that email-messages could contain meta-information in the headers - information about the information being sent. One type of MIME header is "Content-Type," which states the format or data type the object is in. For example, htmL is given the label "text/html," JPEG images are given the label "image/jpeg," etc. There is a registry of MIME types maintained by the Internet Assigned Numbers Authority, at http://www.isi.edu/div7/iana/.

When a browser asks a server for an object, the server gives that object to the browser and states what its "Content-Type" is, and the browser can make an intelligent decision about how to render the document. It can send it to an image program, to a postscript viewer, to a VRML viewer, etc.

What this means to the server maintainer is that every object being served out must have the right MIME type associated with it. Fortunately, there has been a convention of expressing data type through two-, three-, or four-letter suffixes to file name - i.e., "foobar.gif" is most likely to be a gif image.

What the server needs is a file to map the suffix to the MIME content type. Fortunately, Apache comes with such a file in its config directory, a file called mime.types. You'll see that the format of this file is simple. The format consists of one record per line, where a record is a MIME type and a list of acceptable suffixes. This is because while more than one suffix may map to a particular MIME type, you can't have more than one MIME type per suffix. You can use the "TypesConfig" directive to specify an alternative location for the file.

The Internet is evolving so quickly that it would be hard to keep that file completely up to date. To overcome that, you can use a special directive called "AddType," which can be put in an "srm.conf" file like the following:

AddType x-world/x-vrml wrl

Now, whenever the server is asked to serve a file that ends with ".wrl," it knows to also send a header like the following:

Content-type: x-world/x-vrml

Thus, you don't have to worry about reconciling future distributions of the "mime.types" file with your private installations and configuration.

As you'll see in future pages, however, "AddType" is also used to specify "special" files that get magically handled by certain features within the server.

A sister to "AddType" is "AddEncoding." Just as the MIME header "Content-Type" can specify the data format of the object, the header "Content-Encoding" specifies the encoding of the object. An encoding is an attribute of the object as it is being transferred or stored; semantically, the browser should know that is has to "decode" whatever it gets based upon the listed encoding.The most common use is with compressed files. For example, if you have

AddEncoding x-gzip gz

and if you then access a file called "myworld.wrl.gz," the MIME headers sent in response will look like the following:

Content-Type: x-world/x-vrml

Content-Encoding: x-gzip

And any browser worth its two cents will know "Oh, I have to uncompress the file before handing it off to the VRML viewer.

Alias, ScriptAlias, and Redirect

These three directives, all denizens of srm.conf, and all three implemented by the module "mod_alias.c," allow you to have some flexibility with the mapping between "URL-Space" on your server and the actual layout of your file system.

If that last statement sounded cryptic, don't worry. What it basically means is that any URL that looks like "http://myhost.com/x/y/z" does not have to necessarily map to a file named "x/y/z" under the document root of the server:

Alias /path/ /some/other/path/

The preceding directive will take a request for an object from the mythical subdirectory "/path" under the document root and map it to another directory somewhere else entirely. For example, a request for

http://myhost.com/statistics/

might normally go to "document root"/statistics, except that for whatever reason you wanted it to point somewhere else outside of the document root. Say /usr/local/statistics. For that you'd have the following:

Alias /statistics/ /usr/local/statistics/

To the outside user this would be completely transparent. If you use Alias, it's wise not to alias to somewhere else inside of document root. Furthermore, a request like

http://myhost.com/statistics/graph.gif

would get translated into a request for the file

/usr/local/statistics/graph.gif

"ScriptAlias" is just like "Alias," with the side-effect of making everything in the subdirectory by default a CGI script. This might sound a bit bizarre, but the early model for building Web sites had all the CGI functionality separated into a directory by itself, and referenced through the Web server as shown in the following:

http://myhost.com/cgi-bin/script

If you have in your srm.conf

ScriptAlias /cgi-bin/ /usr/local/etc/httpd/cgi-bin/

then the preceding URL points to the script at "/usr/local/etc/httpd/cgi-bin/script." As you'll see in a page or two, there is another way to specify that a file is a CGI script to be executed.

"Redirect" does just that - it redirects the request to another resource. That resource could be on the same machine, or somewhere else on the Net. Also, the match will be a substring match, starting from the beginning. For example, if you did:

Redirect /newyork http://myhost.com/maps/states/newyork

then a request for

http://myhost.com/newyork/index.html

will get redirected to

http://myhost.com/maps/states/newyork/index.html

Of course, the second argument to Redirect can be a URL at some other site. Just make sure that you know what you're doing. Also, be wary of creating loops accidentally. For example,

Redirect /newyork http://myhost.com/newyork/newyork

can have particularly deleterious effects on the server!

A Better Way to Activate CGI Scripts

You read earlier that there is a more elegant way of activating CGI scripts than using "ScriptAlias." You can use the AddType directive and a "magic" MIME type, like so:

AddType application/x-httpd-cgi cgi

When the server gets a request for a ".cgi" file, it maps to that MIME type, and then catches itself and says "Aha! I need to execute this instead of just dish it out like regular files." Thus, you can have CGI files in the same directories as your htmL and gif and all your other files.

A later chapter will go into more detail about the implementation of CGI in Apache.

Directory Indexing

When Apache is given a URL to a directory, instead of to a particular file, for example

http://myhost.com/statistics/

Apache first looks for a file specified by the "DirectoryIndex" directive in "srm.conf." In the default configs, this is "index.html." You can set a list of files to search for, or even an absolute path to a page or CGI script:

DirectoryIndex index.cgi index.html /cgi-bin/go-away

The preceding directive says to look for an "index.cgi" in the directory first. If that can't be found, then look for an "index.html" in the directory. If neither can be found, then redirect the request to "/cgi-bin/go-away."

If it all fails to find a match, then Apache will create, completely on the fly, an htmL listing of all the files available in the directory:

<Give a figure here of the directory listing output>

There are quite a few ways to customize the output of the directory indexing functionality. First, you need to ask yourself if you care about seeing things like icons, last-modified times, etc., in the reports. If you do, then you want to turn to

FancyIndexing On

otherwise you'll just get a simple menu of the available files, which you may want for security or performance reasons.

With that going on, you must ask whether you need to customize it further, and how. The default settings for the directory indexing functionality are already pretty elaborate.

The AddIcon, AddIconByEncoding, and AddIconByType directives customize the selection of icons next to files. AddIcon matches icons at the filename level by using the pattern

AddIcon iconfile filename [filename] [filename]...

Thus, for example,

AddIcon /icons/binary.gif .bin .exe

means that any file which ends in ".bin" or ".exe" should get the "binary.gif" icon attached. The file names can also be a wildcard expression, a complete file name, or even one of two "special" names: "^^DIRECTORY^^" for directories and "^^BLANKICON^^" for blank lines. So you can see lines like

     AddIcon /icons/dir.gif ^^DIRECTORY^^

     AddIcon /icons/old.gif *~

Finally, the "iconfile" can actually also be a string containing both the iconfile's name and the alternate text to put into the ALT attribute. So, your examples should really be

     AddIcon (BIN,/icons/binary.gif) .bin .exe

     AddIcon (DIR,/icons/dir.gif) ^^DIRECTORY^^

The "AddIconByType" directive is actually a little bit more flexible and probably comes more highly recommended in terms of actual use. Instead of tying icons to file name patterns, it ties icons to the MIME type associated with the files. The syntax is very roughly the same:

     AddIconByType iconfile mime-type [mime-type]...

"Mime-type" can be either the exact MIME type matching what you have assigned a file, or it can be a pattern match. Thus, you see entries in the default configuration files like the following:

     AddIconByType (SND,/icons/sound2,gif) audio/*

This is a lot more robust than trying to match against file name suffixes.

"AddIconByEncoding" is used mostly to distinguish compressed files from the others. This makes sense only if used in conjunction with "AddEncoding" directives in your "srm.conf" file. The default "srm.conf" has these entries:

AddEncoding x-gzip gz

AddEncoding x-compress Z

AddIconByEncoding (CMP,/icons/compressed.gif)

    x-compress x-gzip

This will set the icon next to compressed files appropriately.

The "DefaultIcon" directive specifies the icon to use when none of the patterns match a given file when the directory index is generated.

     DefaultIcon /icons/unknown.gif

It is possible to add text to the top and the bottom of the directory index listing. This capability is very useful as it turns the directory indexing capabilities from just a UNIX-like interface into a real dynamic document interface. There are two directives to control this: "HeaderName" and "ReadmeName," which specify the file names for the content at the top and bottom of the listing, respectively. Thus, as shown in the default srm.conf file:

     HeaderName HEADER

     ReadmeName README

When the directory index is being built, Apache will look for "HEADER.html." If it finds it, it'll throw the content into the top of the directory index. If it fails to find that file, it'll look for just "HEADER," and if it finds that it will presume the file is plain text and do things like escape characters such as "<" to "<", and then insert it into the top of the directory index. The same process happens for the file "README," except that the resulting text goes into the bottom of the generated directory index.

In many cases, be it for consistency or just plain old security reasons, you will want to have the directory indexing engine just ignore certain types of files, like Emacs backup files or files beginning with a ".". The "IndexIgnore" directive addresses this; the default setting is

IndexIgnore */.??* *~ *# */HEADER* */README* */RCS

This line might look cryptic, but it's basically a space-separated list of patterns. The first pattern matches against any "." file that is longer than 3 characters. This is so that the link to the higher-up directory ("..") can still work. The second ("*~") and third ("*#") are common patterns for matching old emacs backup files. The next ones are to avoid listing the same files used for "HeaderName" and "ReadmeName" as in the preceding. The last ("*/RCS") is given because many sites out there use "RCS," a software package for revision control maintenance, which stores its extra (rather sensitive) information in "RCS" directories.

Finally you get to two really interesting directives for controlling the last set of options regarding directory indexing. The first is "AddDescription," which works similarly to "AddIcon."

AddDescription description filename [filename]...

I.e.

     AddDescription "My cat" /private/cat.gif

As elsewhere, "file name" can actually be a pattern, so you can have

     AddDescription "An MPEG Movie Just For You!" *.mpg

Finally, you have the granddaddy of all options-setting directives, "IndexOptions." This is the smorgasbord of functionality control. The syntax is simple:

IndexOptions option [option]...

And the list of available options are as follows:

Option Explanation
FancyIndexing This is the same as the separate "FancyIndexing" directive. Sorry to confuse everyone, but backward compatibility demands bizarre things sometimes!
IconsAreLinks If this is set the icon will be clickable as a link to whatever resource the entry it is associated with links to. In other words, the icon becomes part of the hyperlink.
ScanhtmLTitles When given a listing for an htmL file, the server will open the htmL file and parse it to obtain the value of the <TITLE> field in the htmL document, if it exists. This can put a pretty heavy load on the server, since it's a lot of disk accessing and some amount of CPU to extract the title from the htmL, so it's not recommended unless you know you have the capacity.
SuppressDescription, SuppressLastModified, SuppressSize These will suppress their respective fields in the directory indexing output. Normally each of those (Description, Last Modified, Size) is a field in the output listings.

By default none of these are turned on. The options do not "merge," which means that when you are setting these on a per-directory basis by using either access.conf or .htaccess files, setting the options for a more specific directory requires resetting the complete options listing. For example, envision the following in your access configuration file:

     <Directory /pub/docs/>

     IndexOptions ScanhtmLTitles

     </Directory>

     <Directory /pub/docs/others/>

     IndexOptions IconsAreLinks

     </Directory>

Directory listings done in or below the second directory, "/pub/docs/others/,", would not have "ScanhtmLTitles" set. Why? Well, you figured administrators would need to be able to disable an option they had set globally in a specific directory, and this was simpler than writing "NOT" logic into the options listings.

If you run into problems getting directory indexing to work, make sure that the settings you have for the "Options" directive in the access config files allow for directory indexing in that directory. Specifically, the "Options" directive must include "Indexing." Furthermore, if you are using .htaccess files to set things like "AddDescription" or "AddIcon," the "AllowOverride" directive must include in its list of options "FileInfo." This is covered in more depth later on in this chapter.

User Directories

Sites with many users sometimes prefer to be able to give their users access to managing their own parts of the Web tree in their own directories, using the URL semantics of

http://myhost.com/~user/

Where "~user" is actually an alias to a directory in the user's home directory. This is different from the "Alias" directive, which could only map a particular pseudo-directory into an actual directory. In this case, you want "~user" to map to something like "/home/user/public_html," and because the number of "users" can be very high, some sort of macro is useful here. That macro is the directive "UserDir."

With "UserDir" you specify the subdirectory within the users' home directory where they can put content, which is mapped to the "~user" URL. So in other words, the default

UserDir public_html

will cause a request for

http://myhost.com/~eric/index.html

to cause a lookup for the UNIX file

/home/eric/public_html/index.html

presuming that "/home/eric" is eric's home directory. The default of "public_html" is a historical artifact more than anything else. There's no reason why you can't make it "Web_stuff" or something like that.

Apache 1.1's user-directory module will have even more functionality, but at press time the feature set has not been nailed down.

Special Modules

Most of the functionality that distinguishes Apache from the competition has been implemented as modules to the Apache API. This has been extremely useful in allowing functionality to evolve separately from the rest of the server, and for allowing for performance tuning. This section will cover that extra functionality in detail.

Server Side Includes

Server side includes are best described as a preprocessing language for htmL. The "processing" takes place on the server side, such that visitors to your site never need know that you use server side includes, and thus requires no special client software. The format of these includes looks something like the following:

<!--#directive attribute="value" -->

Sometimes a given "directive" can have more than one attribute at the same time. The funky syntax is due to the desire to hide this functionality within an SGML comment - that way your regular htmL validation tools will work without having to learn new tags or anything. The syntax is important; leaving off the final "--," for example, will result in errors.

#include

This directive is probably the most commonly used directive. It is used to insert another file into the htmL document. The allowed attributes for this directive are "virtual" and "file." The functionality of the "file" attribute is a subset of that provided by the "virtual" attribute, and it exists mostly for backward compatibility, so its use is not recommended.

The "virtual" attribute instructs the server to treat the value of the attribute as a request for a relative link - meaning that you can use "../" to locate objects above the directory, and that other transforms like Alias will apply.

For example:

          <!--#include virtual="quote.txt" -->

          <!--#include virtual="/toolbar/footer.html" -->

          <!--#include virtual="../footer.html" -->

#exec

This directive is used to run a script on the server side and insert its output into the SSI document being processed. There are two choices: executing a CGI script by using the "cgi" attribute, or executing a shell command by using the "cmd" attribute.

For example:

          <!--#exec cgi="counter.cgi" -->

would take the output of the CGI program "counter.cgi" and insert it into the document. Note that the CGI output still has to include the "text/html" content type header or an error will occur.

Likewise,

          <!--#exec cmd="ls -l" -->

would take the output of a call to "ls -l" in the document's directory and insert it. Like the "file" attribute to the #include directive, this is mostly for backward compatibility, because it is something of a security hole in an untrusted environment.

There are definitely security concerns with allowing users access to CGI functionality, and even greater concerns with "#exec cmd", such as cmd="cat /etc/passwd". If the site administrator wishes to allow people to use server-side includes, but not to use the #exec directive, then they can set "IncludesNOEXEC" as an option for the directory in the access configurations.

#echo

This directive has one attribute, "var," whose value is any CGI environment variable as well as a small list of other variables:

Attribute Defintion
DATE_GMT The current date in Greenwich Mean Time.
DATE_LOCAL The current date in the local time zone.
DOCUMENT_NAME The file system name of the SSI document, not including the directories below it.
DOCUMENT_URI In a URL of the format "http://host/path/file." This is the "/path/file" part.
LAST_MODIFIED The date the SSI document was modified.

Example:

     <!--#echo var="DATE_LOCAL" -->

This will insert something like "Wednesday, 06-Mar-96 10:44:54 GMT" into the document.

#fsize, #flastmod

These two directives print out the size and the last-modified date, respectively, of any object given by the URI listed in the "file" or "virtual" attribute, as in the #include directive. For example

     <!--#fsize file="index.html" -->

would return the size of the index.html file in that directory.

#config

You can modify the rendering of certain SSI directives by using this directive.

The "sizefmt" attribute controls the rendering of the "#fsize" directive with values of "bytes" or "abbrev." The exact number of bytes is printed when "bytes" is given, whereas an abbreviated version of the size (either in K for kilobytes or M for megabytes) is given when "abbrev" is set.

Thus, for example, a snippet of SSI htmL like

  <!--#config sizefmt="bytes" -->

  The index.html file is <!--#fsize

    virtual="index.html" --> bytes

would return "The index.html file is 4,522 bytes." Meanwhile, if

  <!--#config sizefmt="abbrev" -->

was used, "The index.html file is 4K bytes" would be returned. The default is "abbrev."

The "timefmt" directive controls the rendering of the date in the "DATE_LOCAL," "DATE_GMT," and "LAST_MODIFIED" values for the #echo directive. It uses the same format as the "strftime" call (In fact, that's what the server does. It calls "strftime.") This format consists of variables that begin with "%." For example, "%H" is the hour of the day, in 24-hour format. The list of variables is best found by consulting your system's "man" page by typing man strftime for directions as to how to construct a "strftime"-format date string.

An example, though, might be:

     <!--#config timefmt="%Y/%m/%d-%H:%M:%S" -->

and the resulting date string for Jan. 2nd, 1996, at 12:30 in the afternoon would thus be

     1996/01/02-12:30:00

Finally, the last attribute the "config" directive can take is "errmsg," which is simply the error to print out if there are any problems parsing the document. For example, the right default is:

     <!--#config errmsg="An error occurred while

    processing this directive" -->

Internal Imagemap Capabilities

The default imagemap module supplied with Apache allows you to reference imagemaps without using or needing any CGI programs. This functionality is contained in the "mod_imap" module. First, you add to your srm.conf yet another magic AddType directive:

     AddType application/x-httpd-imap map

This now means that any file ending with ".map" will be recognized as an imagemap file. After restarting the server to pick up the change, one can make reference to a .map file directly.

Look at an example: the following document, index.html, has an imagemap on it, where the image is "usa.jpg" and the mapfile is "usa-map.map." The htmL to build that imagemap would look like:

     <A HREF="usa-map.map"><IMG SRC="usa.jpg" ISMAP></A>

Imagemaps are covered in more detail in a later chapter--the only important thing from a configuration standpoint is that the magic content type is activated.

Cookies

HTTP "cookies" are a method for maintaining statefulness in a stateless protocol. What does this mean? In HTTP, a session between a client and a server typically spans many separate actual TCP connections, thus making it difficult to tie together accesses into an application that requires state, such as a shopping cart application. Cookies are a solution to that problem. As implemented by Netscape in their browser and subsequently by many others, servers can assign clients a "cookie," meaning some sort of opaque string whose meaning is significant only to the server itself, and then the client can give that cookie back to the server on subsequent requests.

The module "mod_cookies" nicely handles the details of assigning unique cookies to every visitor, based on their hostname and a random number. This cookie can be accessed from the CGI environment as the HTTP_COOKIE environment variable, for the same reason that all HTTP headers are accessible to CGI applications. The CGI scripts can use this as a key in a session tracking database, or it can be logged and tallied up to get a good, if undercounted, estimate of the total number of users that visited a site, not just the number of hits or even number of unique domains.

Happily, there are no configuration issues here - simply compile with mod_cookies and away you go. Couldn't be easier.

Configurable Logging

For most folks, the default logfile format (also known as Common Logfile Format, or CLF) does not provide enough information when it comes to doing a serious analysis of the efficacy of your web site. It provided basic numbers in terms of raw hits, pages accessed, hosts accessing, timestamps, etc., but it fails to capture the "referring" URL, the browser being used, and any cookies being used as well. So, there are two ways to get more data for your logfiles: by using the NCSA-compatibility directives for logging certain bits of info to separate browsers, or using Apache's own totally configurable logfile format.

NCSA Compatibility

For compatibility with the NCSA 1.4 Web server, two modules were added. These modules log the "User-Agent" and "Referer" headers from the HTTP request stream.

"User-Agent" is the header most browsers send that identify what software the browser is using. Logging of this header can be activated by an "AgentLog" directive in the srm.conf file, or in a virtualhost-specific section. This directive takes one argument, the name of the file to which the user-agents are logged. For example:

     AgentLog logs/agent_log

To use this, though, you need to ensure that the "mod_log_agent" module has been compiled and linked to the server.

Similarly, the "Referer" header is sent by the browser to indicate the tail end of a link - in other words, when you are on a page with a URL of "A," and there is a link on that page with a URL of "B," and you follow that link, the request for page "B" includes a "Referer" header with the URL of "A." This is very useful for finding what sites out there link to your site, and what proportion of traffic they account for.

The logging of this header is activated by a "RefererLog" directive, which points to the file to which the referers get logged.

     RefererLog logs/referer_log

One other option the Referer logging module provides is RefererIgnore, a directive that allows you to ignore "Referer" headers, which contain some string. This is useful for weeding out the Referers from your own site, if all you are interested in is links to you from other sites. For example, if your site is "www.myhost.com," you might want to use the following:

     RefererIgnore www.myhost.com

Remember that logging of the "Referer" header requires compiling and linking in mod_log_referer.

Totally Configurable Logging

The previous modules were provided, like many Apache features, for backward compatibility. They have some problems, though. Because they don't contain any other information about the request they are logging from, it's nearly impossible to tell which "Referer" fields went to which specific objects on your site. Ideally all the information about a transaction with the server can be logged into one file, extending the "common logfile format" or replacing it altogether. Well, such a beast exists, in the "mod_log_config" module.

This module implements the "LogFormat" directive, which takes as its argument a string, with variables beginning with "%" to indicate different pieces of data from the request. The variables are:

Variable Definition
%h Remote host
%l Remote "identd" identification
%u Remote user, as determined by any user authentication that may take place. Note that if the user was not authenticated, and the status of the request is a 401, this field may be bogus.
%t The common logfile format for time.
%r First line of request
%s Status. For requests that got internally redirected, this is status of the original request; %>s will give the last.
%b Bytes sent.
%{Foobar}i The contents of Foobar: header line(s) in the request from the client to the server.
%{Foobar}o The contents of Foobar: header line(s) in the response from the server to the client.

So, for example, if you wanted to capture in your log just the remote hostname, the object they requested, and the timestamp, you would do the following:

     LogFormat "%h \"%r\" %t"

And that would log things that looked like

     host.outsider.com "GET / HTTP/1.0"

    [06/Mar/1996:10:15:17]

Note that you really have to use a quote around the request variable - the configurable logging module does not escape the values of the variables. But use a slash-quote, "\"", to distinguish that from the end of the string.

Say you want to add logging of the "User-Agent" string to that as well - in this case, your log format would become:

     LogFormat "%h \"%r\" %t \"%{User-Agent}i\""

Because the User-Agent field typically has spaces in it, it too should be quoted. Say you want to capture the Referer field:

     LogFormat "%h \"%r\" %t %{Referer}i"

You don't need the escaping quotes because Referer headers, since they are URL's, don't have spaces in them. However, if you are building a mission-critical application you might as well quote it as well, because the "Referer" header is supplied by the client and thus there are no guarantees about its format.

The default format is the Common Logfile Format (CLF), which in this syntax is expressed as

     LogFormat "%h %l %u %t \"%r\" %s %b"

In fact, most existing logfile analysis tools for CLF will ignore extra fields tacked onto the end, so to capture the most important extra information and yet still be parseable by those tools, you might want to use the format:

     LogFormat "%h %l %u %t \"%r\" %s %b

     %{Referer}i \"%{User-Agent}i\""

Power users take note: If you want even more control over what gets logged, you can use the configurable logging module to implement a simple conditional test for variables. This way, you can configure it to only log variables when a particular status code is returned, or not returned. The format for this is to insert a comma-separate list of those codes between the "%" and the letter of the variable, like so:

     %404,403{Referer}i

This means that the Referer header will only be logged if the status returned by the server is a "404 Not Found," or a "403 Access Denied." All other times just a "-" is logged. This would be useful if all you cared about using "Referer" for was to find out old links that point to resources no longer available.

The negation of that conditional is to put a "!" at the beginning of the list of status codes; so for example,

     %!401u

will log the user in any user authentication transaction, unless the authentication failed, in which case you probably don't want to see the name of the bogus user anyway.

Remember that, like many functions, this can be configured per virtual host. Thus, if you want all logs from all virtual hosts on the same server to go to the same log, you might want to do something like

     LogFormat "hosta ...."

in the <virtualhost> sections for "hosta" and

     LogFormat "hostb ...."

in the <virtualhost> sections for "hostb." More details about virtual hosts will appear later in this chapter.

A key note: You have to compile in "mod_log_config" for this functionality. You must also make sure that the default logging module, "mod_log_common," is not compiled in, or the server will get confused.

Content Negotiation

Content negotiation is the mechanism by which a Web client can express to the server what data types it knows how to render, and based on that information, the server can give the client the "optimal" version of the resource requested. Content negotiation can happen on a number of different characteristics - the content type of the data (also called the "media type"), the human language the data is in (English? French? etc.), the character set of the document, and its encodings.

Content Type Negotiation

For example, say you want to use inlined JPEG images on your pages. You don't want to alienate people using older browsers, which don't know how to inline JPEG images, so you also make a gif version of that image. Even though the gif might be larger or only 8-bit, that's still better than giving the browser something it can't handle, causing a broken link. So, the browser and the server "negotiate" for which data format the server sends to the client.

The specifications for content negotiation have been a part of HTTP since the beginning. Unfortunately, it can not be relied upon as extensively as one would like. For example, current browsers that implement "plug-ins," by and large do not express in the connection headers which media types they have plug-ins for. Thus, content-negotiation can't be used to decide whether to send someone a "ShockWave" file or its Java equivalent, currently. The only safe place to use it currently is to distinguish between inlined JPEG or gif images on a page. Enough browsers in use today implement content negotiation closely enough to get this functionality.

The mod_negotiation.c in Apache 1.0 implements the content negotiation specifications in an older version of the HTTP/1.0 IETF draft, which at the time of this writing is on its way to informational RFC status. It was removed because the specification was not entirely complete, and a document describing it could not be labeled "Best Current Practice," which is what the HTTP/1.0 specification became. Content negotiation is getting significantly enhanced for HTTP/1.1. However, this doesn't mean it can't be safely used now for inlined image selection.

To activate it, you must include the module "mod_negotiation.c" into the server. There are actually two ways to configure content negotiation:

Using a "type-map" file describing all the variants of a negotiable resource with specific preference values and content characteristics.
Setting an "Options" value called "MultiViews".

Since your focus is pragmatic, you will go only into the "MultiViews" functionality. If you are interested in the type-map functionality, the Apache Web site has documentation on it.

In your access.conf file, find the line that sets the "Options" for the part of the site you wish to enable content negotiation within. This may be the whole site, but that's fine. If "MultiViews" is not present in that line, it must be. The "All" value does not, ironically enough, include "MultiViews." This is again for backward compatibility. So, you might have a line that looks like:

     Options Indexes Includes Multiviews

     Options All MultiViews

Once this change is made, restart your server to pick up the new configuration.

With this turned on, you can do the following: place a JPEG image in a directory, say "/path/," and call it "image.jpg." Now, make an equivalent gif format image, and place it in the same directory, as "image.gif. The URL's for these two objects are

     http://host/path/image.jpg

and

     http://host/path/image.gif

respectively. Now, if you ask your Web browser to fetch,

     http://host/path/image

the server will go into the "/path/" directory, see the two "image" files, and then determine which one to send based on what the client states it can support. In the case where the client says it can accept either JPEG images or gif images equally, the server will choose the version that is the smallest, and send that to the client. Usually, JPEG images are much smaller than gif images.

So, if you made your htmL look something like the following:

     <htmL><HEAD>

     <TITLE>Welcome to the Gizmo Home Page!</TITLE>

     </HEAD><BODY>

     <IMG SRC="/header" ALT="GIZMO Logo">

     Welcome to Gizmo!

     <IMG SRC="/products" ALT="Products">

     <IMG SRC="/services" ALT="Services">

then you can have separate gif and JPEG files for "header" "products," and "services," and the clients will for the most part get what they claim they can support.

Note that, if you have a file called "image" and a file called "image.gif," the file called "image" will be requested no matter if a request is made for just "image." Likewise, a request specifically for "image.gif" would never return "image.jpg" even if the client knew how to render JPEG images.

Human Language Negotiation

If "MultiViews" is enabled, you can also distinguish resources by the language they are in, such as French, English, Japanese, etc. This is done by adding more entries to the file suffix namespace that map to the languages the server wishes to use, and then giving them a ranking that ties can be broken. Specifically, in the "srm.conf" file, go two new directives, "AddLanguage" and "LanguagePriority." The formats are as follows:

     AddLanguage en .en

     AddLanguage it .it

     AddLanguage fr .fr

     AddLanguage jp .jp

     LanguagePriority en fr jp it

Say you want to use this to negotiate on the file "index.html," which you had available in English, French, Italian, and Japanese. You would create an "index.html.en," "index.html.fr," "index.html.it," and "index.html.jp," respectively, and then reference the document as "index.html." When a multilingual client connects, it should indicate in one of the request headers ("Accept-Language," to be specific) which languages it prefers, and it expresses that in standard two-letter notation. The server sees what the clients can accept, and gives them "the best one." LanguagePriority is what organizes that decision of "the best one." If English is unacceptible to the client, try French, otherwise try Japanese, otherwise try Italian. LanguagePriority also states which one should be served if there is no "Accept-Language" header.

Because the language mapping suffixes and the content-type suffixes share the same namespace, you can mix them around. "index.fr.html" is the same as "index.html.f.," Just make sure that you reference it with the correct negotiable resource.

As-Is Files

Often, you might like to request specific HTTP headers in your documents, such as Expires:, but you don't want to make the page a CGI script. The easiest way is to use the "httpd/send-as-is" magic MIME type.

     AddType httpd/send-as-is asis

This means that any file that ends in ".asis" can include its own MIME headers. However, it must include two carriage returns before the actual body of the content. Actually, it should include two carriage return / line feed combinations, but Apache is forgiving and will insert that for you. So, if you wanted to send a document with a special unique custom MIME type you didn't want registered with the server, you can send:

     Content-type: text/foobar



     This is text in a very special "foobar" MIME type.

The most significant application I've run across for this is as an extremely efficient mechanism for doing server-push objects without CGI scripts. The reason a CGI script is needed to create a server-push usually is that the Content-type usually includes the multipart separator (since a server-push is actually a MIME multipart message). I.e.:

     Content-type: multipart/x-mixed-replace;boundary=XXXXXXXX



     --XXXXXXXX

     Content-type: image/gif



     ....(gif data)....

     --XXXXXXXX

     Content-type: image/gif



     ....(gif data)....

     --XXXXXXXX

     ....

By making this stream of data a simple file instead of a CGI script, you save yourself potentially a lot of overhead. Just about the only thing you lose is the ability to do timed pushes, but for many people their slow internet connection acts as a sufficient time valve already.

It should also be pointed out that if you have MultiViews turned on, you can add an ".asis" to the end of a file name and none of your links need to be renamed. I.e. "foobar.html" can easily become "foobar.html.asis," while still being able to call it "foobar.html."

One last compelling application of "asis" is being able to do HTTP redirection without needing access to server config files. For example, the following .asis file will redirect people to another location:

     Status 302 Moved

     Location: http://some.other.place.com/path/

     Content-type: text/html



     <htmL>

     <HEAD><TITLE>We've Moved!</TITLE></HEAD>

     <BODY>

     <H1>We used to be here, but now we're

     <A HREF="http://some.other.place.com/path/">over there. </A>

     </H1>

     </BODY></htmL>

The htmL body is there simply for clients which don't understand the 302 response.

Advanced Functionality

Host-Based Access Control

One can control access to the server, or even a subdirectory of the server, based on the hostname, domain, or IP number of the client's machine. This is done by using the directives "allow" and "deny," which can be used together at the same time by using "order." "allow," and "deny" can take multiple hosts:

     deny from badguys.com otherbadguys.com

Typically, you want to do one of two things: you want to deny access to your server from everyone but a few other machines, or you want to grant access to everyone except a few hosts. The first case is handled as follows:

     order allow,deny

     allow from mydomain.com

     deny from all

This means, "only grant access to hosts in the domain 'mydomain.com'." This could include "host1.mydomain.com," "ppp.mydomain.com," "the-boss.mydomain.com," etc.

The "order" directive above tells the server to evaluate the "allow" conditions before the "deny" conditions when determining whether to grant access. Likewise, the "only exclude a couple of sites" case described above can be handled by using:

     order deny,allow

     deny from badguys.com

     allow from all

"order" is needed because, again mostly for historical reasons, the order in which directives appear is not significant. Thus, the server needs to know which rule to apply first. The default for "order" is "deny,allow."

There is a third argument to "order," called "mutual-failure," in which a condition has to pass both the "allow" and "deny" rules in order to succeed. In other words, it has to appear on the "allow" list, and it must not appear on the "deny" list. For example,

     order mutual-failure

     allow from mydomain.com

     deny from the-boss.mydomain.com

In this example, "the-boss.mydomain.com" is prevented from accessing this resource, but every other machine at "mydomain.com" can access it.

It should be mentioned at this point that protecting resources by hostname is dangerous. It is relatively easy for a determined persons who control the reverse-DNS mapping for their IP number to spoof any hostname they want. Thus, it is strongly recommended that you use IP numbers to protect anything sensitive. In the same way you can simply list the domain to refer to any machine in that domain, you can also give fragments of IP numbers:

     allow from 204.62.129

This will only allow hosts whose IP numbers match that, such as "204.62.129.1" or "204.62.129.130."

Typically these directives are used within a "<Limit>" container, and even that within a "<Directory>" container, usually in an access.conf configuration file. The following example is a good template for most protections; it protects the directory "/www/htdocs/private" from any host except those in the "204.62.129" IP space.

     <Directory /www/htdocs/private>

     Options Includes

     AllowOverride None

     <Limit GET POST>

     order allow,deny

     deny from all

     allow from 204.62.129

     </Limit>

     </Directory>

User Authentication

When you place a resource under "user authentication," you restrict access to it by requiring a name and password. This name and password is kept in a database on the server. This database can take many forms, and in fact Apache modules have been written to access flat file databases, DBM file databases, Msql databases (a freeware database), Oracle and Sybase databases, and more. This book covers only the flat-file and DBM-format databases.

First, some basic configuration directives. The "AuthName" directive sets the authentication "Realm" for the password-protected pages. The "Realm" is what gets presented to clients when prompted for authentication - "Please enter your name and password for the realm ."

The "AuthType" directive sets the "authentication type" for the area. In HTTP/1.0 there is only one authentication type, and that is "Basic." HTTP/1.1 will have a few more, such as "MD5."

The "AuthUserFile" directive specifies the file thT contains a list of names and passwords, one pair per line, where the passwords are encrypted by using the simple Unix crypt() routines. I.e.

     joe:D.W2yvlfjaJoo

     mark:21slfoUYGksIe

The "AuthGroupFile" directive specifies the file which contains a list of groups, and members of those groups, separated by spaces. For example:

     managers: joe mark

     production: mark shelley paul

Finally, the "require" directive specifies what conditions need to be met for access to be granted. It can list only a specified list of users who may connect, it can specify a group or list of groups of users who may connect, or it can say any valid user in the database is automatically granted access. For example:

     require user mark paul

     (Only mark and paul may access.)



     require group managers

     (Only people in group managers may access.)



     require valid-user

     (Anyone in the AuthUserFile database may access.)

The configuration file ends up looking something like this:

     <Directory /www/htdocs/protected/>

     AuthName Protected

     AuthType basic

     AuthUserFile /usr/local/etc/httpd/conf/users

     <Limit GET POST>

     require valid-user

     </Limit>

     </Directory>

If you want to protect it to a particular group, the configuration file looks something like the following:

     <Directory /www/htdocs/protected/>

     AuthName Protected

     AuthType basic

     AuthUserFile /usr/local/etc/httpd/conf/users

     AuthGroupFile /usr/local/etc/httpd/conf/group

     <Limit GET POST>

     require group managers

     </Limit>

     </Directory>

DBM Authentication

Apache can be configured to also use DBM files for faster password and group-membership lookups. To use this, you must have the "mod_auth_dbm" module compiled into the server.

"DBM" files are UNIX file types that implement a fast hashtable lookup, making them ideal for handling large user/password databases. The flat-file systems requires parsing the password file for every access until a match is found, potentially going through the entire file before returning a can't find that user error. Hash tables, on the other hand, know instantly whether a "key" exists in the database, and what its value is.

Different systems implement them slightly differently - some use the "ndbm" libraries, some use the "berkeley db" libraries, but the interface through Apache is exactly the same.

To use a DBM file for the database instead of a regular flat file, you use a different directive, "AuthDBMUserFile" instead of "AuthUserFile." Likewise for the group file - "AuthDBMGroupFile" instead of "AuthGroupFile" is used.

Take a look at creating the DBM files. There is a program supplied in the "support" subdirectory of the Web site called "dbmmanage." It is a file for creating and managing DBM files. The basic syntax is as follows:

     dbmmanage dbmfile command key [value]

"Command" can be one of: add, adduser, view, delete

So, to add a "value" to a DBM file called "users," one would say:

     dbmmanage users add joe joespassword

You have just added a record to the DBM file, with "joe" as the key and "joespassword" as the value. To see this you say:

     dbmmanage users view joe

or if you want to see the whole database,

     dbmmanage users view

However, you want to store encrypted passwords, because that's what the server uses for authentication. For that you use the "adduser" command:

     dbmmanage users adduser joe joespassword

Now, if you do a "view" to look at it, "joespassword" will be replaced by a lot of what looks like junk. Don't worry, that's the encrypted password.

Groups are done a little bit differently in DBM files. Instead of making the key of the database the group, the key is the user and the value is a comma-separated list of the groups that user is in. For example:

     dbmmanage group adduser joe managers,production

Wait, you say, there's no file called "users." Why do I see a "users.pag" and "users.dir"?

Well, DBM files are pretty weird. They aren't like regular files; they can't be looked at. Some systems implement the hash table by keeping the index separate from the data, as in this example with the ".pag" and ".dir" files. On BSD systems, where Berkeley DB is implemented, DBM files are saved with a ".db" appendix. So, one should get used to the idea that the "name" of a DBM file is actually its filename without the suffix.

The configuration file snippet now looks something like:

     <Directory /www/htdocs/protected/>

     AuthName Protected

     AuthType basic

     AuthDBMUserFile /usr/local/etc/httpd/conf/users

     AuthDBMGroupFile /usr/local/etc/httpd/conf/group

     <Limit GET POST>

     require group managers

     </Limit>

     </Directory>

Note that "users" and "groups" must be the "name" of the DBM file, as described in the preceding, and pointing to

     AuthDBMUserFile /usr/local/etc/httpd/conf/users.db

would not work.

Make sure that you don't put the user and group databases in the public Web tree, ever. Several Web search engines out there have proven themselves to be efficient sources for "/etc/passwd" files unintentionally put on the site. Don't take that risk.

Virtual Hosts

Apache implements a very clean way of handling "virtual hosts," which is the name for the mechanism for being able to serve more than one "host" on a particular machine. Due to a limitation in HTTP, this is accomplished currently by assigning more than one IP number to a machine, and then having Apache bind differently to those different IP numbers. For example, a Unix box might have 204.122.133.1, 204.122.133.2, and 204.122.133.3 pointing to it, with www.host1.com bound to the first, www.host2.com bound to the second, and www.host3.com bound to the third.

This book will not go into how to configure additional IP addresses for your machine, since that varies completely from platform to platform. Your user manual for the operating system should contain information about configuring additional numbers - this is a standard capability on just about all systems these days.

"Virtual hosts" are configured using a container in httpd.conf. They look something like this:

     <VirtualHost www.host1.com>

     DocumentRoot /www/htdocs/host1/

     TransferLog logs/access.host1

     ErrorLog logs/error.host1

     </VirtualHost>

The attribute in the VirtualHost tag is the hostname, which the server looks up to get an IP address. Note that if there is any chance that "www.host1.com" can return more than one number, or if the web server might have trouble resolving that to an IP number at any point, you might want to use the IP number instead.

Any directives put within the VirtualHost container pertain only to requests made to that hostname. The DocumentRoot points to a directory which (presumably) contains content specifically for www.host1.com.

Each virtual host can have its own access log, its own error log, its own derivative of the other logs out there, its own Redirect and Alias directives, its own ServerName and ServerAdmin directives, and more. In fact the only things it can not support, out of the core set of directives, are:

ServerType, UserId, GroupId, StartServers, MaxSpareServers, MinSpareServers, MaxRequestsPerChild, BindAddress, PidFile, TypesConfig, and ServerRoot.

If you plan on running Apache with a large number of virtual hosts, you need to be careful to watch the process limits; for example, some unix platforms only allow processes to open 64 file descriptors at once. An Apache child will consume one file descriptor per logfile per virtual host, so 32 virtualhosts each with their own transfer and error log would quickly cross that limit. You will notice if you are running into problems of this kind if your error logs start reporting errors like "unable to fork()", or your access logs aren't getting written to at all. Apache does try and call setrlimit() to handle this problem on its own, but the system sometimes prevents it from doing so successfully.

Customized Error Messages

Apache can give customized responses in the event of an error. This is controlled using the "ErrorDocument" directive. The syntax is:

     ErrorDocument <HTTP response code> <action>

Where "HTTP response code" is the event which triggers the "action." The "action" can be:

A local URI to which the server is internally redirected.
An external URL to which the client is redirected.
A text string, which starts with a '"', and where the %s variable contains any extra information if available.

For example:

     ErrorDocument 500 "Ack! We have a problem here: %s.

     ErrorDocument 500 /errors/500.cgi

     ErrorDocument 500 http://backup.myhost.com/

     ErrorDocument 401 /subscribe.html

     ErrorDocument 404 /debug/record-broken-links.cgi

Two extra CGI variables will be passed to any redirected resource: "REDIRECT_URL" will contain the original URL requested, and "REDIRECT_STATUS" will give the original status that caused the redirection. This will help the script if its job is to try and figure out what caused the error response.

Assorted "httpd.conf" Settings

There are a couple last configuration options that fell through the cracks.

BindAddress

At startup, Apache will bind to the port it is specified to bind to, for all IP numbers which the box has available. The "BindAddress" directive can be used to specify only a specific IP address to bind to. Using this, one can run multiple copies of Apache, each serving different virtual hosts, instead of having one daemon which can handle all virtual hosts. This is useful if you want to run two web servers with different system user-id's for security and access control reasons.

For example, let's say you have three IP addresses (1.1.1.1, 1.1.1.2, and 1.1.1.3, with 1.1.1.1 being the primary address for the machine), and you want to run three web servers, yet you want one of them to run as a different user ID than the other two. One would have two sets of configuration files; one would say something like

User web3

BindAddress 1.1.1.3

ServerName www.company3.com

DocumentRoot /www/company3/

And the other would have

User web1

ServerName www.company1.com

DocumentRoot /www/company1/

<VirtualHost 1.1.1.2>

ServerName www.company2.com

DocumenbtRoot /www/company2/

</VirtualHost>

If you launch the first, it will only bind to IP address "1.1.1.3". The second one, since it has no "BindAddress" directive, will bind to the port on all IP addresses. So, you want to launch a server with the first set of config files, then launch another copy of the server with the second set. There would essentially be two servers running.

PidFile

This is the location of the file containing the process-ID for Apache. This file is useful for being able to automate the shutdown or restart of the web server. By default, this is "logs/httpd.pid." For example, one could shut down the server by saying:

     cat /usr/local/etc/httpd/logs/httpd.pid | xargs kill -15

You might want to move this out of the "logs" directory and into something like "/var," but it's not necessary.

Timeout

This directive specifies the amount of time that the server will wait in between packets sent before considering the connection "lost." For example, "1200," the default, means that the server will wait for 20 minutes after sending a packet before it considers the connection dead if no response comes back. Busy servers may wish to turn this down, at the cost of reduced service to low-bandwidth customers.