Getting hold of remote weblogs

The last post was assuming that the weblogs to analyse are directly accessible by the R session which may not be the case if your analysis is running on a remote machine. Also in some cases you may want to filter out some uninteresting log records (eg local clients on the web server or local area accesses from known clients). The next examples show how to modify the previous R script using the R pipe function to take this into account:

## read the last 100K log entries from svr via a ssh connection
## (this assumes you have setup the ssh keys correctly beforehand)
w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log'")))

## in addition filter out all accesses from local clients on the web
## server or the local subnet (in this case 192.168.10.xxx)

w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log | awk \"\\$2 !~ /127\\.0\\.0\\.1|192\\.168\\.10\\./\"'")))
## note: the proper quoting/escaping of R and shell strings on this one takes
## more effort than the processing. There must be an R function which does this...

In a similar way you could concatenate multiple (eg already logrotated) logs and/or unzip logfiles. As this pre-filtering takes place locally on the server machine holding the log files this helps to bring down the data amount to be transfered and analysed: always a good start to avoid the popular ‘unecessarily big data’ syndrome…