Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp.com. Chapter 05 - Apache ConfigurationBy this point you should have a running, minimal web server. In this chapter, you learn about most of the functionality that comes bundled with the server. This chapter is organized as a series of tutorials, so that new users can get up to speed. Toward the end of the chapter, you dive into some experimental Apache modules as well. By the time you read this chapter, given the rapid pace of development, there will be some significantly new functionality implemented and released. However, the existing functionality is not likely to change much. The Apache Group has had a strong ethic toward backward compatibility. In this chapter you will learn how to:
In short, this should cover most of the major functionality of Apache 1.0.
Configuration BasicsThe "srm.conf" (also known as the "ResourceConfig" file, which is a directive that can be set in httpd.conf) and "access.conf" (also known as the "AccessConfig" file, also a directive in httpd.conf) files are where most of the configuration related to the actual objects on the server takes place. The names are mostly historical - at one point, when the server was still NCSA, the only thing "access.conf" was good for was setting permissions, restrictions, authentication, and so forth. Then, when directory indexing was added, the cry went out for the capability to control certain characteristics on a directory-by-directory basis. The "access.conf" file was the only one that had any kind of structure for that: the pseudo-htmL "<Directory>" container.With Apache's revamped configuration file parsing routines, most directives can literally appear anywhere. For example, within "<Directory>" containers in access.conf, within "<VirtualHost>" containers in httpd.conf, and so on. However, for sanity's sake, you should keep some structure to the configuration files. You should put server-processing-level configuration options in httpd.conf (like "Port," "<VirtualHost>" containers, etc.), put generic server resource information in "srm.conf" (like "Redirect," "AddType," directory indexing information, etc.), and per-directory configurations in "access.conf." In addition to the "<Directory>" container, there is the "<Limit>" container, which is used within "<Directory>" containers to specify certain HTTP methods to which particular directives apply. Examples will be given later in this chapter.
Per-Directory Configuration FilesBefore you get too deep into the long list of features, take a look at a mechanism that controls most of those features on a directory-by-directory basis by using a file in that directory itself. You can already control subdirectory options in access.conf, as outlined in the previous chapter. However, for a number of reasons, you may want to allow these configurations to be maintained by people other than those who have the power to restart the server (such as people maintaining their home pages), and for that purpose the "AccessFileName" directive was invented.The default "AccessFileName" is ".htaccess." If you want to use something else, for example ".acc," you would say the following in the srm.conf file:
<Directory /> AllowOverride None </Directory>For the sake of brevity and clarity, let's call these files ".htaccess" files. What options can these files affect? The range of available options is controlled by the "AllowOverride" directive within the <Directory> container in the AccessConfig file, as mentioned previously. The exact arguments to "AllowOverride" are as follows:
Mime Types: AddType and AddEncodingA fundamental element of the HTTP protocol, and the reason why the Web was so natural as a home for multiple media formats, is that every data object transferred through HTTP had an associated MIME type. What does this mean?
When a browser asks a server for an object, the server gives that object to the browser and states what its "Content-Type" is, and the browser can make an intelligent decision about how to render the document. It can send it to an image program, to a postscript viewer, to a VRML viewer, etc. What this means to the server maintainer is that every object being served out must have the right MIME type associated with it. Fortunately, there has been a convention of expressing data type through two-, three-, or four-letter suffixes to file name - i.e., "foobar.gif" is most likely to be a gif image. What the server needs is a file to map the suffix to the MIME content type. Fortunately, Apache comes with such a file in its config directory, a file called mime.types. You'll see that the format of this file is simple. The format consists of one record per line, where a record is a MIME type and a list of acceptable suffixes. This is because while more than one suffix may map to a particular MIME type, you can't have more than one MIME type per suffix. You can use the "TypesConfig" directive to specify an alternative location for the file. The Internet is evolving so quickly that it would be hard to keep that file completely up to date. To overcome that, you can use a special directive called "AddType," which can be put in an "srm.conf" file like the following:
As you'll see in future pages, however, "AddType" is also used to specify "special" files that get magically handled by certain features within the server. A sister to "AddType" is "AddEncoding." Just as the MIME header "Content-Type" can specify the data format of the object, the header "Content-Encoding" specifies the encoding of the object. An encoding is an attribute of the object as it is being transferred or stored; semantically, the browser should know that is has to "decode" whatever it gets based upon the listed encoding.The most common use is with compressed files. For example, if you have
Alias, ScriptAlias, and RedirectThese three directives, all denizens of srm.conf, and all three implemented by the module "mod_alias.c," allow you to have some flexibility with the mapping between "URL-Space" on your server and the actual layout of your file system.If that last statement sounded cryptic, don't worry. What it basically means is that any URL that looks like "http://myhost.com/x/y/z" does not have to necessarily map to a file named "x/y/z" under the document root of the server:
"Redirect" does just that - it redirects the request to another resource. That resource could be on the same machine, or somewhere else on the Net. Also, the match will be a substring match, starting from the beginning. For example, if you did:
A Better Way to Activate CGI ScriptsYou read earlier that there is a more elegant way of activating CGI scripts than using "ScriptAlias." You can use the AddType directive and a "magic" MIME type, like so:
A later chapter will go into more detail about the implementation of CGI in Apache.
Directory IndexingWhen Apache is given a URL to a directory, instead of to a particular file, for example
If it all fails to find a match, then Apache will create, completely on the fly, an htmL listing of all the files available in the directory:
With that going on, you must ask whether you need to customize it further, and how. The default settings for the directory indexing functionality are already pretty elaborate. The AddIcon, AddIconByEncoding, and AddIconByType directives customize the selection of icons next to files. AddIcon matches icons at the filename level by using the pattern
"AddIconByEncoding" is used mostly to distinguish compressed files from the others. This makes sense only if used in conjunction with "AddEncoding" directives in your "srm.conf" file. The default "srm.conf" has these entries:
The "DefaultIcon" directive specifies the icon to use when none of the patterns match a given file when the directory index is generated.
In many cases, be it for consistency or just plain old security reasons, you will want to have the directory indexing engine just ignore certain types of files, like Emacs backup files or files beginning with a ".". The "IndexIgnore" directive addresses this; the default setting is
Finally you get to two really interesting directives for controlling the last set of options regarding directory indexing. The first is "AddDescription," which works similarly to "AddIcon."
By default none of these are turned on. The options do not "merge," which means that when you are setting these on a per-directory basis by using either access.conf or .htaccess files, setting the options for a more specific directory requires resetting the complete options listing. For example, envision the following in your access configuration file:
If you run into problems getting directory indexing to work, make sure that the settings you have for the "Options" directive in the access config files allow for directory indexing in that directory. Specifically, the "Options" directive must include "Indexing." Furthermore, if you are using .htaccess files to set things like "AddDescription" or "AddIcon," the "AllowOverride" directive must include in its list of options "FileInfo." This is covered in more depth later on in this chapter.
User DirectoriesSites with many users sometimes prefer to be able to give their users access to managing their own parts of the Web tree in their own directories, using the URL semantics of
With "UserDir" you specify the subdirectory within the users' home directory where they can put content, which is mapped to the "~user" URL. So in other words, the default
Special ModulesMost of the functionality that distinguishes Apache from the competition has been implemented as modules to the Apache API. This has been extremely useful in allowing functionality to evolve separately from the rest of the server, and for allowing for performance tuning. This section will cover that extra functionality in detail.
Server Side IncludesServer side includes are best described as a preprocessing language for htmL. The "processing" takes place on the server side, such that visitors to your site never need know that you use server side includes, and thus requires no special client software. The format of these includes looks something like the following:
#includeThis directive is probably the most commonly used directive. It is used to insert another file into the htmL document. The allowed attributes for this directive are "virtual" and "file." The functionality of the "file" attribute is a subset of that provided by the "virtual" attribute, and it exists mostly for backward compatibility, so its use is not recommended.The "virtual" attribute instructs the server to treat the value of the attribute as a request for a relative link - meaning that you can use "../" to locate objects above the directory, and that other transforms like Alias will apply. For example:
This directive is used to run a script on the server side and insert its output into the SSI document being processed. There are two choices: executing a CGI script by using the "cgi" attribute, or executing a shell command by using the "cmd" attribute. For example:
Likewise,
#echoThis directive has one attribute, "var," whose value is any CGI environment variable as well as a small list of other variables:
Example:
#fsize, #flastmodThese two directives print out the size and the last-modified date, respectively, of any object given by the URI listed in the "file" or "virtual" attribute, as in the #include directive. For example
#configYou can modify the rendering of certain SSI directives by using this directive.The "sizefmt" attribute controls the rendering of the "#fsize" directive with values of "bytes" or "abbrev." The exact number of bytes is printed when "bytes" is given, whereas an abbreviated version of the size (either in K for kilobytes or M for megabytes) is given when "abbrev" is set. Thus, for example, a snippet of SSI htmL like
The "timefmt" directive controls the rendering of the date in the "DATE_LOCAL," "DATE_GMT," and "LAST_MODIFIED" values for the #echo directive. It uses the same format as the "strftime" call (In fact, that's what the server does. It calls "strftime.") This format consists of variables that begin with "%." For example, "%H" is the hour of the day, in 24-hour format. The list of variables is best found by consulting your system's "man" page by typing man strftime for directions as to how to construct a "strftime"-format date string. An example, though, might be:
Internal Imagemap CapabilitiesThe default imagemap module supplied with Apache allows you to reference imagemaps without using or needing any CGI programs. This functionality is contained in the "mod_imap" module. First, you add to your srm.conf yet another magic AddType directive:
Look at an example: the following document, index.html, has an imagemap on it, where the image is "usa.jpg" and the mapfile is "usa-map.map." The htmL to build that imagemap would look like:
CookiesHTTP "cookies" are a method for maintaining statefulness in a stateless protocol. What does this mean? In HTTP, a session between a client and a server typically spans many separate actual TCP connections, thus making it difficult to tie together accesses into an application that requires state, such as a shopping cart application. Cookies are a solution to that problem. As implemented by Netscape in their browser and subsequently by many others, servers can assign clients a "cookie," meaning some sort of opaque string whose meaning is significant only to the server itself, and then the client can give that cookie back to the server on subsequent requests.The module "mod_cookies" nicely handles the details of assigning unique cookies to every visitor, based on their hostname and a random number. This cookie can be accessed from the CGI environment as the HTTP_COOKIE environment variable, for the same reason that all HTTP headers are accessible to CGI applications. The CGI scripts can use this as a key in a session tracking database, or it can be logged and tallied up to get a good, if undercounted, estimate of the total number of users that visited a site, not just the number of hits or even number of unique domains. Happily, there are no configuration issues here - simply compile with mod_cookies and away you go. Couldn't be easier.
Configurable LoggingFor most folks, the default logfile format (also known as Common Logfile Format, or CLF) does not provide enough information when it comes to doing a serious analysis of the efficacy of your web site. It provided basic numbers in terms of raw hits, pages accessed, hosts accessing, timestamps, etc., but it fails to capture the "referring" URL, the browser being used, and any cookies being used as well. So, there are two ways to get more data for your logfiles: by using the NCSA-compatibility directives for logging certain bits of info to separate browsers, or using Apache's own totally configurable logfile format.
NCSA CompatibilityFor compatibility with the NCSA 1.4 Web server, two modules were added. These modules log the "User-Agent" and "Referer" headers from the HTTP request stream."User-Agent" is the header most browsers send that identify what software the browser is using. Logging of this header can be activated by an "AgentLog" directive in the srm.conf file, or in a virtualhost-specific section. This directive takes one argument, the name of the file to which the user-agents are logged. For example:
Similarly, the "Referer" header is sent by the browser to indicate the tail end of a link - in other words, when you are on a page with a URL of "A," and there is a link on that page with a URL of "B," and you follow that link, the request for page "B" includes a "Referer" header with the URL of "A." This is very useful for finding what sites out there link to your site, and what proportion of traffic they account for. The logging of this header is activated by a "RefererLog" directive, which points to the file to which the referers get logged.
Totally Configurable LoggingThe previous modules were provided, like many Apache features, for backward compatibility. They have some problems, though. Because they don't contain any other information about the request they are logging from, it's nearly impossible to tell which "Referer" fields went to which specific objects on your site. Ideally all the information about a transaction with the server can be logged into one file, extending the "common logfile format" or replacing it altogether. Well, such a beast exists, in the "mod_log_config" module.This module implements the "LogFormat" directive, which takes as its argument a string, with variables beginning with "%" to indicate different pieces of data from the request. The variables are:
So, for example, if you wanted to capture in your log just the remote hostname, the object they requested, and the timestamp, you would do the following:
Say you want to add logging of the "User-Agent" string to that as well - in this case, your log format would become:
The default format is the Common Logfile Format (CLF), which in this syntax is expressed as
The negation of that conditional is to put a "!" at the beginning of the list of status codes; so for example,
Remember that, like many functions, this can be configured per virtual host. Thus, if you want all logs from all virtual hosts on the same server to go to the same log, you might want to do something like
A key note: You have to compile in "mod_log_config" for this functionality. You must also make sure that the default logging module, "mod_log_common," is not compiled in, or the server will get confused.
Content NegotiationContent negotiation is the mechanism by which a Web client can express to the server what data types it knows how to render, and based on that information, the server can give the client the "optimal" version of the resource requested. Content negotiation can happen on a number of different characteristics - the content type of the data (also called the "media type"), the human language the data is in (English? French? etc.), the character set of the document, and its encodings.
Content Type NegotiationFor example, say you want to use inlined JPEG images on your pages. You don't want to alienate people using older browsers, which don't know how to inline JPEG images, so you also make a gif version of that image. Even though the gif might be larger or only 8-bit, that's still better than giving the browser something it can't handle, causing a broken link. So, the browser and the server "negotiate" for which data format the server sends to the client.The specifications for content negotiation have been a part of HTTP since the beginning. Unfortunately, it can not be relied upon as extensively as one would like. For example, current browsers that implement "plug-ins," by and large do not express in the connection headers which media types they have plug-ins for. Thus, content-negotiation can't be used to decide whether to send someone a "ShockWave" file or its Java equivalent, currently. The only safe place to use it currently is to distinguish between inlined JPEG or gif images on a page. Enough browsers in use today implement content negotiation closely enough to get this functionality. The mod_negotiation.c in Apache 1.0 implements the content negotiation specifications in an older version of the HTTP/1.0 IETF draft, which at the time of this writing is on its way to informational RFC status. It was removed because the specification was not entirely complete, and a document describing it could not be labeled "Best Current Practice," which is what the HTTP/1.0 specification became. Content negotiation is getting significantly enhanced for HTTP/1.1. However, this doesn't mean it can't be safely used now for inlined image selection. To activate it, you must include the module "mod_negotiation.c" into the server. There are actually two ways to configure content negotiation:
Since your focus is pragmatic, you will go only into the "MultiViews" functionality. If you are interested in the type-map functionality, the Apache Web site has documentation on it. In your access.conf file, find the line that sets the "Options" for the part of the site you wish to enable content negotiation within. This may be the whole site, but that's fine. If "MultiViews" is not present in that line, it must be. The "All" value does not, ironically enough, include "MultiViews." This is again for backward compatibility. So, you might have a line that looks like:
With this turned on, you can do the following: place a JPEG image in a directory, say "/path/," and call it "image.jpg." Now, make an equivalent gif format image, and place it in the same directory, as "image.gif. The URL's for these two objects are
So, if you made your htmL look something like the following:
Note that, if you have a file called "image" and a file called "image.gif," the file called "image" will be requested no matter if a request is made for just "image." Likewise, a request specifically for "image.gif" would never return "image.jpg" even if the client knew how to render JPEG images.
Human Language NegotiationIf "MultiViews" is enabled, you can also distinguish resources by the language they are in, such as French, English, Japanese, etc. This is done by adding more entries to the file suffix namespace that map to the languages the server wishes to use, and then giving them a ranking that ties can be broken. Specifically, in the "srm.conf" file, go two new directives, "AddLanguage" and "LanguagePriority." The formats are as follows:
Because the language mapping suffixes and the content-type suffixes share the same namespace, you can mix them around. "index.fr.html" is the same as "index.html.f.," Just make sure that you reference it with the correct negotiable resource.
As-Is FilesOften, you might like to request specific HTTP headers in your documents, such as Expires:, but you don't want to make the page a CGI script. The easiest way is to use the "httpd/send-as-is" magic MIME type.
It should also be pointed out that if you have MultiViews turned on, you can add an ".asis" to the end of a file name and none of your links need to be renamed. I.e. "foobar.html" can easily become "foobar.html.asis," while still being able to call it "foobar.html." One last compelling application of "asis" is being able to do HTTP redirection without needing access to server config files. For example, the following .asis file will redirect people to another location:
Advanced FunctionalityHost-Based Access ControlOne can control access to the server, or even a subdirectory of the server, based on the hostname, domain, or IP number of the client's machine. This is done by using the directives "allow" and "deny," which can be used together at the same time by using "order." "allow," and "deny" can take multiple hosts:
The "order" directive above tells the server to evaluate the "allow" conditions before the "deny" conditions when determining whether to grant access. Likewise, the "only exclude a couple of sites" case described above can be handled by using:
There is a third argument to "order," called "mutual-failure," in which a condition has to pass both the "allow" and "deny" rules in order to succeed. In other words, it has to appear on the "allow" list, and it must not appear on the "deny" list. For example,
It should be mentioned at this point that protecting resources by hostname is dangerous. It is relatively easy for a determined persons who control the reverse-DNS mapping for their IP number to spoof any hostname they want. Thus, it is strongly recommended that you use IP numbers to protect anything sensitive. In the same way you can simply list the domain to refer to any machine in that domain, you can also give fragments of IP numbers:
Typically these directives are used within a "<Limit>" container, and even that within a "<Directory>" container, usually in an access.conf configuration file. The following example is a good template for most protections; it protects the directory "/www/htdocs/private" from any host except those in the "204.62.129" IP space.
User AuthenticationWhen you place a resource under "user authentication," you restrict access to it by requiring a name and password. This name and password is kept in a database on the server. This database can take many forms, and in fact Apache modules have been written to access flat file databases, DBM file databases, Msql databases (a freeware database), Oracle and Sybase databases, and more. This book covers only the flat-file and DBM-format databases.First, some basic configuration directives. The "AuthName" directive sets the authentication "Realm" for the password-protected pages. The "Realm" is what gets presented to clients when prompted for authentication - "Please enter your name and password for the realm ." The "AuthType" directive sets the "authentication type" for the area. In HTTP/1.0 there is only one authentication type, and that is "Basic." HTTP/1.1 will have a few more, such as "MD5." The "AuthUserFile" directive specifies the file thT contains a list of names and passwords, one pair per line, where the passwords are encrypted by using the simple Unix crypt() routines. I.e.
<Directory /www/htdocs/protected/> AuthName Protected AuthType basic AuthUserFile /usr/local/etc/httpd/conf/users AuthGroupFile /usr/local/etc/httpd/conf/group <Limit GET POST> require group managers </Limit> </Directory> DBM AuthenticationApache can be configured to also use DBM files for faster password and group-membership lookups. To use this, you must have the "mod_auth_dbm" module compiled into the server."DBM" files are UNIX file types that implement a fast hashtable lookup, making them ideal for handling large user/password databases. The flat-file systems requires parsing the password file for every access until a match is found, potentially going through the entire file before returning a can't find that user error. Hash tables, on the other hand, know instantly whether a "key" exists in the database, and what its value is. Different systems implement them slightly differently - some use the "ndbm" libraries, some use the "berkeley db" libraries, but the interface through Apache is exactly the same. To use a DBM file for the database instead of a regular flat file, you use a different directive, "AuthDBMUserFile" instead of "AuthUserFile." Likewise for the group file - "AuthDBMGroupFile" instead of "AuthGroupFile" is used. Take a look at creating the DBM files. There is a program supplied in the "support" subdirectory of the Web site called "dbmmanage." It is a file for creating and managing DBM files. The basic syntax is as follows:
So, to add a "value" to a DBM file called "users," one would say:
Groups are done a little bit differently in DBM files. Instead of making the key of the database the group, the key is the user and the value is a comma-separated list of the groups that user is in. For example:
Well, DBM files are pretty weird. They aren't like regular files; they can't be looked at. Some systems implement the hash table by keeping the index separate from the data, as in this example with the ".pag" and ".dir" files. On BSD systems, where Berkeley DB is implemented, DBM files are saved with a ".db" appendix. So, one should get used to the idea that the "name" of a DBM file is actually its filename without the suffix. The configuration file snippet now looks something like:
Virtual HostsApache implements a very clean way of handling "virtual hosts," which is the name for the mechanism for being able to serve more than one "host" on a particular machine. Due to a limitation in HTTP, this is accomplished currently by assigning more than one IP number to a machine, and then having Apache bind differently to those different IP numbers. For example, a Unix box might have 204.122.133.1, 204.122.133.2, and 204.122.133.3 pointing to it, with www.host1.com bound to the first, www.host2.com bound to the second, and www.host3.com bound to the third.This book will not go into how to configure additional IP addresses for your machine, since that varies completely from platform to platform. Your user manual for the operating system should contain information about configuring additional numbers - this is a standard capability on just about all systems these days. "Virtual hosts" are configured using a container in httpd.conf. They look something like this:
Any directives put within the VirtualHost container pertain only to requests made to that hostname. The DocumentRoot points to a directory which (presumably) contains content specifically for www.host1.com. Each virtual host can have its own access log, its own error log, its own derivative of the other logs out there, its own Redirect and Alias directives, its own ServerName and ServerAdmin directives, and more. In fact the only things it can not support, out of the core set of directives, are: ServerType, UserId, GroupId, StartServers, MaxSpareServers, MinSpareServers, MaxRequestsPerChild, BindAddress, PidFile, TypesConfig, and ServerRoot. If you plan on running Apache with a large number of virtual hosts, you need to be careful to watch the process limits; for example, some unix platforms only allow processes to open 64 file descriptors at once. An Apache child will consume one file descriptor per logfile per virtual host, so 32 virtualhosts each with their own transfer and error log would quickly cross that limit. You will notice if you are running into problems of this kind if your error logs start reporting errors like "unable to fork()", or your access logs aren't getting written to at all. Apache does try and call setrlimit() to handle this problem on its own, but the system sometimes prevents it from doing so successfully.
Customized Error MessagesApache can give customized responses in the event of an error. This is controlled using the "ErrorDocument" directive. The syntax is:
For example:
Assorted "httpd.conf" SettingsThere are a couple last configuration options that fell through the cracks.
BindAddressAt startup, Apache will bind to the port it is specified to bind to, for all IP numbers which the box has available. The "BindAddress" directive can be used to specify only a specific IP address to bind to. Using this, one can run multiple copies of Apache, each serving different virtual hosts, instead of having one daemon which can handle all virtual hosts. This is useful if you want to run two web servers with different system user-id's for security and access control reasons.For example, let's say you have three IP addresses (1.1.1.1, 1.1.1.2, and 1.1.1.3, with 1.1.1.1 being the primary address for the machine), and you want to run three web servers, yet you want one of them to run as a different user ID than the other two. One would have two sets of configuration files; one would say something like
PidFileThis is the location of the file containing the process-ID for Apache. This file is useful for being able to automate the shutdown or restart of the web server. By default, this is "logs/httpd.pid." For example, one could shut down the server by saying:
TimeoutThis directive specifies the amount of time that the server will wait in between packets sent before considering the connection "lost." For example, "1200," the default, means that the server will wait for 20 minutes after sending a packet before it considers the connection dead if no response comes back. Busy servers may wish to turn this down, at the cost of reduced service to low-bandwidth customers.
|