Monthly Archives: January 2015

Cached, asychronous IP resolving

Resolving IP addresses to host names is quite helpful for getting a quick overview of who is connecting from where. This may need some care to not put too much strain on your DNS server with a large number of repeated lookups. Also you may not want to wait for timeouts on IPs that do not resove. R itself is not supporting this specifically but can easily exploit asyncronous DNS lookup tools like adns (on OSX from homebrew) and provide a cache to speed things up. Here is an simple example for a vertorised lookup using a data.table as persistent cache.

library(data.table)

## this basic aysnc lookup is a modified version of an idea described in
## http://rud.is/b/2013/08/12/reverse-ip-address-lookups-with-r-from-simple-to-bulkasynchronous/
ip.to.host <- function(ips) {
  ## store ip list in a temp file
  tf <- tempfile()
  cat(ips, sep='\n', file=tf)
  ## use the adns filter to resolve them asynchronously (see man page for timeouts and other options)
  host.names <- system(paste("adnsresfilter <", tf) ,intern=TRUE, ignore.stderr=TRUE)
  ## cleanup the temp file
  file.remove(tf)
  return(host.names)
}

## now extend the above to implement a  ip to name cache
ip.cached.lookup <- function(ips, reset.cache=FALSE) {
  cache.file <- "~/.ip.cache.rda"

  ## if the cache file exists: load it
  if (!reset.cache & !file.access(cache.file,4)){
      load(cache.file)
      message("ip cache entries loaded :", nrow(host))
  } else {
      ## create an empty table (with just locahost)
      host <- data.table(hip="127.0.0.1", hname="localhost")
  }

  ## prepare a table of query ip and name
  qh <- data.table(hip=as.character(ips),hname=NA)

  ## keep them sorted by ip to speedup data.table lookups
  setkey(host,hip)

  ## resolve all known host name from the cache
  qh$hname <- host[qh]$hname

  ## collect the list of unique ips which did not get resolved yet
  new.ips <- unique(qh[is.na(qh$hname)]$hip)

  ## if not empty, resolve the rest
  if (length(new.ips) > 0) {
    ## add the new ips to the cache table
    host <- rbind(host, list(hip=new.ips,hname=NA))

    ## find locations which need resolution (either new or expired)
    need.resolving <- is.na(host$hname)

    message("new ips to resolve: ", sum(need.resolving))
    ## and resolve them
    host$hname[need.resolving] <- ip.to.host(host[need.resolving]$hip)

    ## need to set key again after rbind above..
    setkey(host,hip)

    ## .. to do the remaining lookups
    qh$hname <- host[qh]$hname

    ## save the new cache status
    save(host, file = cache.file)
  }

  return(qh$hname)
}

## with this function you can easily add a host.name column to your
## weblog data.table from the previous posts to get started with
## the real log analysis

w$host.name <- ip.cached.lookup(w$host)

Getting hold of remote weblogs


The last post was assuming that the weblogs to analyse are directly accessible by the R session which may not be the case if your analysis is running on a remote machine. Also in some cases you may want to filter out some uninteresting log records (eg local clients on the web server or local area accesses from known clients). The next examples show how to modify the previous R script using the R pipe function to take this into account:

## read the last 100K log entries from svr via a ssh connection
## (this assumes you have setup the ssh keys correctly beforehand)
w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log'")))

## in addition filter out all accesses from local clients on the web
## server or the local subnet (in this case 192.168.10.xxx)

w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log | awk \"\\$2 !~ /127\\.0\\.0\\.1|192\\.168\\.10\\./\"'")))
## note: the proper quoting/escaping of R and shell strings on this one takes
## more effort than the processing. There must be an R function which does this...

In a similar way you could concatenate multiple (eg already logrotated) logs and/or unzip logfiles. As this pre-filtering takes place locally on the server machine holding the log files this helps to bring down the data amount to be transfered and analysed: always a good start to avoid the popular ‘unecessarily big data’ syndrome…

Using R for weblog analysis

Apache Weblog Analysis

Whether you run your own blog or web server or use some hosted service – at some point you may be interested in some information on how well your server or your users are doing. Many infos like hit frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can be obtained in different ways:

  • by instrumenting the page running inside the client browser (eg piwik)
  • by analysis of the web server logs (eg webalizer)

For the latter I have been using for several years webalizer, which does nice web based analysis plots. More recently I moved to a more complicated server environment with several virtual web services and I found the configuration and data selection options a bit limting. Hence I started as a toy project to implement the same functionality with a set of simple R scripts, which I will progressively share here.

As a first step some simple examples for the data import, cleaning and overview plots. We’ll then add anychronous IP resolution, add and analyse goelocation information and as a last step wrap the analysis output tables and plots into a web application, which can be consulted from a remote browser.

data.table vs. dplyr

One of my favourite R packages for data handling, which I will use also here is the ‘data.table’ package. Note: Most of the results can be obtained in a similar way also using the excellent ‘dplyr’ package, but for some of my other (larger volume) projects data.table has some performance and memory efficiency advantages, so I’ll stick to data.table. If you are using R for data handling/aggregating and are not familar with either packages – take a look at both and make your own choice.

Importing the logs into R

Well, this part is rather simple since apache logs can be read via the standard read.table function:

library(data.table)
## read the complete log - your file name is likely different
w <- data.table(read.table("/var/log/apache2/access_log"))

## there are a few different log types which vary in the number and sequence
## if log items. Have a look at the apache configuration or just the file.
## In my case I get a so called 'combinedvhost' file which lists in the first
## two columns the website (out of several virtual sites on the some server)
## and as second field the client host which accessed the server.
## There is a good chance that your server config does omit the first field
## so you may try to drop the 'vhost' string below.

setnames(w,c('vhost','host','ident','authuser','date','tz','request','status','bytes','refer','agent'))

## try the following command to see if data and field names match:
summary(w)
## btw: already this summary shows a lot of interesting info