Using R for weblog analysis
Dirk Duellmann
Apache Weblog Analysis
Whether you run your own blog or web server or use some hosted service – at some point you may be interested in some information on how well your server or your users are doing. Many infos like hit frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can be obtained in different ways:
- by instrumenting the page running inside the client browser (eg piwik)
- by analysis of the web server logs (eg webalizer)
For the latter I have been using for several years webalizer, which does nice web based analysis plots. More recently I moved to a more complicated server environment with several virtual web services and I found the configuration and data selection options a bit limting. Hence I started as a toy project to implement the same functionality with a set of simple R scripts, which I will progressively share here.
As a first step some simple examples for the data import, cleaning and overview plots. We’ll then add anychronous IP resolution, add and analyse goelocation information and as a last step wrap the analysis output tables and plots into a web application, which can be consulted from a remote browser.
data.table vs. dplyr
One of my favourite R packages for data handling, which I will use also here is the data.table
package. Note: Most of the results can be obtained in a similar way also using the excellent dplyr
package, but in some other (larger volume) studies data.table
has some performance and
memory efficiency advantages, so I’ll stick to it also here. If you are using R
for data
handling/aggregating and are not familar with either packages – take a look at both and make your
own choice.
Importing the logs into R
Well, this part is rather simple since apache logs can be read via the standard read.table function:
library(data.table)
## read the complete log - your file name is likely different
w <- data.table(read.table("/var/log/apache2/access_log"))
## there are a few different log types which vary in the number and sequence
## if log items. Have a look at the apache configuration or just the file.
## In my case I get a so called 'combinedvhost' file which lists in the first
## two columns the website (out of several virtual sites on the some server)
## and as second field the client host which accessed the server.
## There is a good chance that your server config does omit the first field
## so you may try to drop the 'vhost' string below.
setnames(w,c('vhost','host','ident','authuser','date','tz','request','status','bytes','refer','agent'))
## try the following command to see if data and field names match:
summary(w)
## btw: already this summary shows a lot of interesting info