aggregating and analyzing logs from multiple web servers
Written by jlgaddis on October 7, 2007 – 7:47 pm -Over a year ago, we replaced our single web server running IIS5 on Windows 2000 with a pair of HP DL140’s running Apache on RHEL4. These two machines serve up our main web site as well as a number of low-traffic sites, to the world. When the Marketing folks, who handle the web sites, asked if we could generate statistics on the various web sites, I knew it wouldn’t be as simple as if we just had one server. awstats, my log analyzer of choice, works best when it has a single logfile to import. How then, could I aggregate the logfiles from two (or more) servers into a single logfile in chronological order?
For security reasons, I prefer to keep awstats on a separate server and configure Apache (and the firewalls) so that it’s only available from a few subnets. We already had an existing instance of awstats installed on another server from our II5/Win2K web server days, so I decided the easiest method would be to simply update it, reconfigure it, and make it available once again. The aggregated logfile and statistics, therefore, would need to be on a third server where awstats was at. Hmm…
I could have, fairly easily, used a combination of shell scripts and user accounts with SSH keys properly set up in order to transfer the log files to the third system. I’m kinda lazy, though (in the “good sysadmins are lazy” way). Being a huge open source zealot (in a predominantly Microsoft environment, nonetheless), I already have a database server running MySQL supporting some internal (and public) applications. Likewise, I have lots of existing Perl code that I’ve already written to work with MySQL data in a variety of ways. I decided that MySQL would be the solution and Perl would be the glue to hold it all together.
My Apache logs use the default logging format, with two exceptions: the first field of each entry is the virtual hostname, and the referer and user agent are included in the same file (“combined”). My “LogFormat” directive, in httpd.conf, looks like this:
LogFormat "%V %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
A bit farther down in our httpd.conf file, we use the following directive so that we end up with a single logfile with all the data we want, written to logs/accesslog:
CustomLog logs/accesslog combined For the sake of both performance and simplicity, a single logfile is used. Many times you’ll see web servers that use a separate log file per virtual host, which is a performance hit to the server. (Note that I’ve never done any benchmarking to see how much of a performance hit, however.) The methods that follow can easily be adapted to support multiple log files.
If we’re going to be storing our log data in MySQL, we’re going to need a database. In my case, I’ve created a database (“wwwlogs”) that holds a single table (“logs”). The table looks like this:
mysql> DESCRIBE logs; +-----------+--------------+------+-----+---------------------+----------------+ | Field | Type | Null | Key | Default | Extra | +-----------+--------------+------+-----+---------------------+----------------+ | id | bigint(20) | | PRI | NULL | autoincrement | | vhost | varchar(255) | | | | | | clientip | varchar(16) | | | 0.0.0.0 | | | date | datetime | | | 1970-01-01 00:00:00 | | | request | text | | | | | | status | smallint(6) | | | 0 | | | bytes | int(11) | | | 0 | | | referer | text | | | | | | useragent | text | | | | | +-----------+--------------+------+-----+---------------------+----------------+ 9 rows in set (0.00 sec)There’s a few fields in the Apache logfile that we don’t care about, so they’re not even imported into the database which saves us from having to ignore them later. The fields in the table map to the logfile as follows:
- id : a primary key field generated by the database server (which we don’t directly use)
- vhost : the virtual host (e.g. “www.domain.com”)
- clientip : ip address of the client making the request
- date : the date and time of the request
- request : the actual resource (url) requested by the client
- status : the status code returned by the server
- bytes : the number of bytes returned by the server
- referer : the referer provided by the client
- useragent : the user-agent string provided by the client
Shortly after 05:00AM, our Marketing group can navigate to the awstats site and view the statistics from the previous day.
When asked if I could provide statistics from our web sites, I was told that money was available if we needed to purchase a commercial product to provide this information. Once again, however, open source saves the day. This entire setup cost my company nothing, except for the couple of hours I spent setting the whole thing up. Could we have purchased and used a commercial product and gained even more insight into our web site? Absolutely. Since our web folks aren’t (yet) looking for that information, however, there was no need.
Once I clean up the code a bit (removing database credentials and such), I’ll post the scripts I wrote and some more details on the crontab entries. To be quite honest, this probably won’t be a huge priority for me so if you’re interested in getting a copy of the code, please post a comment encouraging me to do so. =)
One last note: like just about everything else, there are a number of different means one could use to come up with the same end result. I chose this path as it was the easiest for me. I’d be interested in hearing from others that have been approached with a similar problem but came up with a different solution (and, perhaps more importantly, how they came to that solution).
Tags: linux, open-source | 4 Comments »




October 7th, 2007 at 8:15 pm
[...] You can read the rest of this blog post by going to the original source, here [...]
October 7th, 2007 at 8:47 pm
I’ve thought about logging directly to mysql using modlogsql but haven’t got around to it. We pull in the logs from multiple web servers after they are rotated and then merge them into a single log. Simple and it’s nice to have the raw logs on hand.
I don’t use awstats but I think it comes with a script to do the merge called logresolvemerge.pl
February 13th, 2008 at 11:48 am
Any chance of posting httpd-logs-to-sql.pl (given the structure of common logs, a straight split() on whitespace won’t give you all the fields…)
February 13th, 2008 at 6:49 pm
PS – tip: IP addresses are big integers, not strings. Store them in their native format and your DB will be happy. http://torque.oncloud8.com/archives/000370.html