Posts on analytx.info

Analytx site moved to Hugo

Sat, 30 Jan 2021 11:03:00 +0100

Looking at my web logs and getting scared into regular updating WordPress and its modules and its back-end database has convinced me that I do not need all that complexity for a handful of posts per year. Despite the excellent feedback I have received (sadly, mostly from WordPress comment spammers), I decided to simplify and move this small site to the static site generator Hugo that I use for my other sites.

Monty Hall problem - a small simulation in R

Sun, 12 Nov 2017 12:32:00 +0100

The Monty Hall problem is an interesting example for how much intuition can mislead us in some statistical contexts. Even more disturbing though is, for how long we are prepared to debate and defend an expected result before actually checking our initial guesses using a simple Monte Carlo simulation. Here is simple simulation implementation the Monty Hall game show problem: In the TV show “Let’s Make a Deal” the host Monty Hall would offer to game participant the choice of three doors.

Slides and blog posts with R and emacs org-mode

Sun, 04 Jun 2017 12:59:00 +0200

Preparing a larger number of slides with R code and plots can be a bit tedious with standard desktop presentation software like powerpoint or keynote. The manual effort to change the example code, run the analysis and then cut and paste updated graphs, tables and code is high. Sooner or later one is bound to create inconsistencies between code and expected results or even syntax errors

Using Data Frames in Feather format (Apache Arrow)

Thu, 07 Apr 2016 18:43:00 +0200

Triggered by the RStudio blog article about feather I did the one line install and compared the results on a data frame of 19 million rows. First results look indeed promising: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # build the package > devtools::install_github("wesm/feather/R") # load an existing data frame (19 million rows with batch job execution results) > load("batch-12-2015.

Setting up an RStudio server for iPad access

Tue, 05 Apr 2016 00:00:00 +0200

Sometimes it can be convenient to run RStudio remotely from an iPad or another machine with little RAM or disk space. This can be done quite easily using the free RStudio Server on OSX via docker. To do this: Find the rocker/rstudio image on docker hub and follow the setup steps here at github. Once the image is running, you should be able to connect with Safari on the host Mac to the login page eg at

Cached, asychronous IP resolution

Thu, 15 Jan 2015 19:39:00 +0100

Resolving IP addresses to host names is quite helpful for getting a quick overview of who is connecting from where. This may need some care to not put too much strain on your DNS server with a large number of repeated lookups. Also you may not want to wait for timeouts on IPs that do not resove. R itself is not supporting this specifically but can easily exploit asyncronous DNS lookup tools like adns (on OSX from homebrew) and provide a cache to speed things up.

Using R for weblog analysis

Thu, 15 Jan 2015 19:39:00 +0100

Apache Weblog Analysis Whether you run your own blog or web server or use some hosted service – at some point you may be interested in some information on how well your server or your users are doing. Many infos like hit frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can be obtained in different ways: by instrumenting the page running inside the client browser (eg piwik) by analysis of the web server logs (eg webalizer) For the latter I have been using for several years webalizer, which does nice web based analysis plots.

Getting hold of remote weblogs

Tue, 13 Jan 2015 01:25:00 +0100

The last post was assuming that the weblogs to analyse are directly accessible by the R session which may not be the case if your analysis is running on a remote machine. Also in some cases you may want to filter out some uninteresting log records (eg local clients on the web server or local area accesses from known clients). The next examples show how to modify the previous R script using the R pipe function to take this into account:

Hadoop + kerberos with RStudio

Tue, 02 Dec 2014 18:23:00 +0100

While the setup from the previous posts works for the hadoop shell commands, you will still fail to access the remote cluster from GUI programs (eg RStudio) and/or with hadoop plugins like RHadoop. There are two reasons for that: GUI programs do not inherit your terminal/shell enviroment variables - unless you start them from a terminal session with $ open /Applications/RStudio.app $HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even if the variables are present in their execution environment.

Connect to a remote, kerberized hadoop cluster

Tue, 02 Dec 2014 10:28:00 +0100

To use a remote hadoop cluster with kerberos authentication you will need to get a proper krb5.conf file (eg from your remote cluster /etc/kerb5.conf) and place the file /etc/krb5.conf on your client OSX machine. To use this configurations from your osx hadoop client change your .[z]profile to: export HADOOP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf" export YARN_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf" With java 1.7 this should be sufficient to detect the default realm, the kdc and also any specific authentication options used by your site.

Basic set-up of hadoop on OSX yosemite

Tue, 18 Nov 2014 21:49:00 +0100

Why would I do this? An OSX laptop will not allow to do any larger scale data processing, but it may be convenient place to develop/debug hadoop scripts before running on a real cluster. For this you likely want to have a local hadoop “cluster” to play with, and use the local commands as client for an larger remote hadoop cluster. This post covers the local install and basic testing. A second post shows how to extend the setup for accessing /processing against a remote kerberized cluster.

Hugo and org: experience some time after migration

Mon, 01 Jan 0001 00:00:00 +0000

I have now converted all my wordpress and drupal based blog sites to Hugo and Org format as input. Doom emacs has simplified this process as the standart config is already convenient to use and does only require minor tweeks (eg in case you would like to keep a few unpublished blog entries in a common org file).