Monty Hall problem – a small simulation in R

The Monty Hall problem is an interesting example for how much intuition can mislead us in some statistical contexts. Even more disturbing though is, for how long we are prepared to debate and defend an expected result before actually checking our initial guesses using a simple Monte Carlo simulation.

Here is simple simulation implementation the Monty Hall game show problem:

In the TV show “Let’s Make a Deal” the host Monty Hall would offer a game participant the choice of three doors. One of them was hiding a valuable price (eg a car) – behind the other two doors were only two less desirable goats.

After the participant has choose a door, which is still kept closed, Monty would open one of the other doors showing that it did not hide the price. After that door was open the participant was asked if he/she wanted to maintain their initial choice or rather change their mind and now swap to the other of the two still closed doors.

It turns out, that swapping the initial choice of door is in fact a better strategy in this game, which is considered by many counter-intuitive. The public discussion about this problem in a print magazine revealed a lot about the excitement and sometime rather unscientific culture when debating a “scientific” question via public media. And twitter was not even available.

Goat_Simulator_cover.jpg

Here a small Monte Carlo toy in R to compare between the two game strategies. It uses mainly two features:

  • the sample() function to pick randomly – here from a vector of doors.
  • vectors indexed with negative indices: evaluate to all but the “subtracted” elements
doors     <- 1:3  # 3 doors named 1 to 3
N         <- 1000 # we play N times

swap_wins <- 0    # set win counter to 0

for (i in 1:N) {
  price  <- sample(doors,1)                   # randomly pick a door for the price
  player <- sample(doors,1)                   # player randomly picks a door
  monty  <- sample(doors[-c(player,price)],1) # monty picks from remaining doors
                                              # eg price and player doors removed

  swap_wins <- (price != player) + swap_wins  # did swap strategy win this time?
}                                             # (price was not behind initial choice)

message("fraction of swap wins: f",round(swap_wins/N,3))

The result of around 0.66 confirms that indeed swapping the door after Monty has opened another one is a better strategy than sticking to the initial choice, which amounts only to 1/3 probability of success.

Such a simulation is of course neither the only nor the most elegant way to arrive at this result. But in case of doubt it looks definitely easier to run a quick test than to embark in a public discussions. At the time this involved the exchange of 10,000 reader letters in the US and proportinal excitement in similar thread in Germany. For more details check out the wikipedia article here or a small book about this topic (in German) here.

PS: Did you notice that the simulation program does not actually use the line that determines Monty’s pick? A hint to an analytical solution.

Slides and blog posts with R and emacs org-mode

Preparing a larger number of slides with R code and plots can be a bit tedious with standard desktop presentation software like powerpoint or keynote. The manual effort to change the example code, run the analysis and then cut and paste updated graphs, tables and code is high. Sooner or later one is bound to create inconsistencies between code and expected results or even syntax errors

Using emacs and its swiss army knife org-mode there is another elegant and reproducible solution: just export the consistent code and output from an org-mode code block to generate either slides or bog pages (or both) from a single source. As a free benefit freen this tool chain can also produce high quality PDF handouts with a managed table of content and an index.

Org Mode source blocks

Here is a standard statistics “hello world” example using ggplot2:

library(ggplot2)

df <- data.frame( norm = rnorm(10000))
ggplot(df, aes(x = norm)) + geom_histogram()

p.png

Continue reading

Using Data Frames in Feather format (Apache Arrow)

Triggered by the RStudio blog article about feather I did the one line install and compared the results on a data frame of 19 million rows. First results look indeed promising:

# build the package
> devtools::install_github("wesm/feather/R")

# load an existing data frame (19 million rows with batch job execution results)
> load("batch-12-2015.rda")

# write it in feather format...
> write_feather(dt,"batch-12-2015.feather")

# ... which is not compressed, hence larger on disk
> system("ls -lh batch-12-2015.*")
-rw-r--r-- 1 dirkd staff 813M 7 Apr 11:35 batch-12-2015.feather
-rw-r--r-- 1 dirkd staff 248M 27 Jan 22:42 batch-12-2015.rda

# a few repeat reads on an older macbook with sdd
> system.time(load("batch-12-2015.rda"))
user system elapsed
8.984 0.332 9.331
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.103 1.094 7.978
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.045 0.352 9.418
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.110 0.658 3.997
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.009 0.356 9.393
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.099 0.711 4.548

So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) ! Of course these measurements are from file system cache rather than the laptop SSD, but the reduction in wall time is nice for larger volume loads.

More important though is the cross-language support for R, Python, Scala/Spark and others, which could make feather the obvious exchange format within a team or between workflow steps that use different implementation languages.

Setting up an RStudio server for iPad access

Sometimes it can be convenient to run RStudio remotely from an iPad or another machine with little RAM or disk space. This can be done quite easily using the free RStudio Server on OSX via docker. To do this:

  • Find the rocker/rstudio image on docker hub and follow the setup steps here [github].
  • Once the image is running, you should be able to connect with Safari on the host Mac to the login page eg at
    $ open http://192.168.99.100:8787
    
  • Now there is is only a small last step needed. You need to expose the server port from the host on the local network using the OSX firewall. In the somewhat explicit language of the “new” OSX firewall this can be done using:

    $ echo "rdr pass inet proto tcp from any to any port 8787 -> 127.0.0.1 port 8787" | sudo pfctl -ef -
    

    At this point you should be able to connect remotely from your iPad to

    http://<main-mac-ip-or-name>:8787
    

    and continue your R session where you left it before eg on your main machine.

    BTW: If your network can not be trusted then you should probably change the default login credentials as described in the image docs.

    Cached, asychronous IP resolving

    Resolving IP addresses to host names is quite helpful for getting a quick overview of who is connecting from where. This may need some care to not put too much strain on your DNS server with a large number of repeated lookups. Also you may not want to wait for timeouts on IPs that do not resove. R itself is not supporting this specifically but can easily exploit asyncronous DNS lookup tools like adns (on OSX from homebrew) and provide a cache to speed things up. Here is an simple example for a vertorised lookup using a data.table as persistent cache.

    library(data.table)
    
    ## this basic aysnc lookup is a modified version of an idea described in
    ## http://rud.is/b/2013/08/12/reverse-ip-address-lookups-with-r-from-simple-to-bulkasynchronous/
    ip.to.host <- function(ips) {
      ## store ip list in a temp file
      tf <- tempfile()
      cat(ips, sep='\n', file=tf)
      ## use the adns filter to resolve them asynchronously (see man page for timeouts and other options)
      host.names <- system(paste("adnsresfilter <", tf) ,intern=TRUE, ignore.stderr=TRUE)
      ## cleanup the temp file
      file.remove(tf)
      return(host.names)
    }
    
    ## now extend the above to implement a  ip to name cache
    ip.cached.lookup <- function(ips, reset.cache=FALSE) {
      cache.file <- "~/.ip.cache.rda"
    
      ## if the cache file exists: load it
      if (!reset.cache & !file.access(cache.file,4)){
          load(cache.file)
          message("ip cache entries loaded :", nrow(host))
      } else {
          ## create an empty table (with just locahost)
          host <- data.table(hip="127.0.0.1", hname="localhost")
      }
    
      ## prepare a table of query ip and name
      qh <- data.table(hip=as.character(ips),hname=NA)
    
      ## keep them sorted by ip to speedup data.table lookups
      setkey(host,hip)
    
      ## resolve all known host name from the cache
      qh$hname <- host[qh]$hname
    
      ## collect the list of unique ips which did not get resolved yet
      new.ips <- unique(qh[is.na(qh$hname)]$hip)
    
      ## if not empty, resolve the rest
      if (length(new.ips) > 0) {
        ## add the new ips to the cache table
        host <- rbind(host, list(hip=new.ips,hname=NA))
    
        ## find locations which need resolution (either new or expired)
        need.resolving <- is.na(host$hname)
    
        message("new ips to resolve: ", sum(need.resolving))
        ## and resolve them
        host$hname[need.resolving] <- ip.to.host(host[need.resolving]$hip)
    
        ## need to set key again after rbind above..
        setkey(host,hip)
    
        ## .. to do the remaining lookups
        qh$hname <- host[qh]$hname
    
        ## save the new cache status
        save(host, file = cache.file)
      }
    
      return(qh$hname)
    }
    
    ## with this function you can easily add a host.name column to your
    ## weblog data.table from the previous posts to get started with
    ## the real log analysis
    
    w$host.name <- ip.cached.lookup(w$host)
    

    Getting hold of remote weblogs


    The last post was assuming that the weblogs to analyse are directly accessible by the R session which may not be the case if your analysis is running on a remote machine. Also in some cases you may want to filter out some uninteresting log records (eg local clients on the web server or local area accesses from known clients). The next examples show how to modify the previous R script using the R pipe function to take this into account:

    ## read the last 100K log entries from svr via a ssh connection
    ## (this assumes you have setup the ssh keys correctly beforehand)
    w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log'")))
    
    ## in addition filter out all accesses from local clients on the web
    ## server or the local subnet (in this case 192.168.10.xxx)
    
    w <- data.table(read.table(pipe("ssh svr 'tail -n 100000 /var/log/apache2/access_log | awk \"\\$2 !~ /127\\.0\\.0\\.1|192\\.168\\.10\\./\"'")))
    ## note: the proper quoting/escaping of R and shell strings on this one takes
    ## more effort than the processing. There must be an R function which does this...
    

    In a similar way you could concatenate multiple (eg already logrotated) logs and/or unzip logfiles. As this pre-filtering takes place locally on the server machine holding the log files this helps to bring down the data amount to be transfered and analysed: always a good start to avoid the popular ‘unecessarily big data’ syndrome…

    Using R for weblog analysis

    Apache Weblog Analysis

    Whether you run your own blog or web server or use some hosted service – at some point you may be interested in some information on how well your server or your users are doing. Many infos like hit frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can be obtained in different ways:

    • by instrumenting the page running inside the client browser (eg piwik)
    • by analysis of the web server logs (eg webalizer)

    For the latter I have been using for several years webalizer, which does nice web based analysis plots. More recently I moved to a more complicated server environment with several virtual web services and I found the configuration and data selection options a bit limting. Hence I started as a toy project to implement the same functionality with a set of simple R scripts, which I will progressively share here.

    As a first step some simple examples for the data import, cleaning and overview plots. We’ll then add anychronous IP resolution, add and analyse goelocation information and as a last step wrap the analysis output tables and plots into a web application, which can be consulted from a remote browser.

    data.table vs. dplyr

    One of my favourite R packages for data handling, which I will use also here is the ‘data.table’ package. Note: Most of the results can be obtained in a similar way also using the excellent ‘dplyr’ package, but for some of my other (larger volume) projects data.table has some performance and memory efficiency advantages, so I’ll stick to data.table. If you are using R for data handling/aggregating and are not familar with either packages – take a look at both and make your own choice.

    Importing the logs into R

    Well, this part is rather simple since apache logs can be read via the standard read.table function:

    library(data.table)
    ## read the complete log - your file name is likely different
    w <- data.table(read.table("/var/log/apache2/access_log"))
    
    ## there are a few different log types which vary in the number and sequence
    ## if log items. Have a look at the apache configuration or just the file.
    ## In my case I get a so called 'combinedvhost' file which lists in the first
    ## two columns the website (out of several virtual sites on the some server)
    ## and as second field the client host which accessed the server.
    ## There is a good chance that your server config does omit the first field
    ## so you may try to drop the 'vhost' string below.
    
    setnames(w,c('vhost','host','ident','authuser','date','tz','request','status','bytes','refer','agent'))
    
    ## try the following command to see if data and field names match:
    summary(w)
    ## btw: already this summary shows a lot of interesting info
    

    hadoop + kerberos with GUI programs (eg RStudio with RHadoop)

    While the setup from the previous posts works for the hadoop shell commands, you will still fail to access the remote cluster from GUI programs (eg RStudio) and/or with hadoop plugins like RHadoop.

    There are two reasons for that:

    • GUI programs do not inherit your terminal/shell enviroment variables – unless you start them from a terminal session with
    $ open /Applications/RStudio.app
    
    • $HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even if the variables are present in their execution environment.

    The first problem is well covered by various blog posts. The main difficulty is only to find the correct procedure for your OSX version,since Apple has changed several times over the years:

    • using a .plist file in ~ /.MacOS (before Maverics)
    • using a setenv statement line /etc/launchd.conf (Mavericks)
    • using the launchctl setenv command (from Yosemite)

    To find out which variable is used inside your GUI program or plugin may need some experimentation or look at the source. For java based plugins the variable _JAVA_OPTIONS which is always evaluated may be a starting point. For RHadoop package the more specific HADOOP_OPTS is already sufficient, so on yosemite:

    $ launchctl setenv HADOOP_OPTS "-Djava.security.krb5.conf=/etc/krb5.conf"
    # prefix command with sudo in case you want the setting for all users
    

    If you need the setting only inside R/RStudio you could simply add the enviroment setting in your R scripts before initialising the RHadoop packages.

    # wrapper script:  hadoop --config ~/remote-hadoop-conf
    hadoop.command <- "~/scripts/remote-hadoop"
    
    Sys.setenv(HADOOP_OPTS ="-Djava.security.krb5.conf=/etc/krb5.conf")
    Sys.setenv(HADOOP_CMD=hadoop.command)
    
    # load hdfs plugin for R
    library(rhdfs)
    hdfs.init()
    
    # print remote hdfs root directory
    print(hdfs.ls("/"))
    

    Connect to a remote, kerberized hadoop cluster

    To use a remote hadoop cluster with kerberos authentication you will need to get a proper krb5.conf file (eg from your remote cluster /etc/kerb5.conf) and place the file /etc/krb5.conf on your client OSX machine. To use this configurations from your osx hadoop client change your .[z]profile to:

    export HADOOP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf"
    export YARN_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf"
    

    With java 1.7 this should be sufficient to detect the default realm, the kdc and also any specific authentication options used by your site. Please make sure the kerberos configuration is already in place when you obtain your ticket with

    $ kinit
    

    In case you got a ticket beforehand you may have to execute kinit again or login to local account again.

    For the next step you will need to obtain the remote cluster configuration files (eg scp the config files from the remote cluster to a local directory, eg to ~/remote-hadoop-conf). The result should be a local copy similar to this:

    $ ls -l  ~/remote-hadoop-conf
    
    total 184
    -rw-r--r--  1 dirkd  staff  4146 Jun 25  2013 capacity-scheduler.xml
    -rw-r--r--  1 dirkd  staff  4381 Oct 21 11:44 core-site.xml
    -rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 dfs.includes
    -rw-r--r--  1 dirkd  staff     0 Jun 25  2013 excludes
    -rw-r--r--  1 dirkd  staff   896 Dec  1 11:44 hadoop-env.sh
    -rw-r--r--  1 dirkd  staff  3251 Aug  5 09:50 hadoop-metrics.properties
    -rw-r--r--  1 dirkd  staff  4214 Oct  7  2013 hadoop-policy.xml
    -rw-r--r--  1 dirkd  staff  7283 Nov  3 16:44 hdfs-site.xml
    -rw-r--r--  1 dirkd  staff  8713 Nov 18 16:26 log4j.properties
    -rw-r--r--  1 dirkd  staff  6112 Nov  5 16:52 mapred-site.xml
    -rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 mapred.includes
    -rw-r--r--  1 dirkd  staff   127 Apr  4  2014 taskcontroller.cfg
    -rw-r--r--  1 dirkd  staff   931 Oct 20 09:44 topology.table.file
    -rw-r--r--  1 dirkd  staff    70 Jul  2 11:52 yarn-env.sh
    -rw-r--r--  1 dirkd  staff  5559 Nov  5 16:52 yarn-site.xml
    

    Then point your hadoop and hdfs command to this configuration:

    $ hdfs --config ~/remote-hadoop-conf dfs -ls /
    

    If all worked well, then you should see at this point the content of the remote hdfs directory and you will be ready to use the standard hdfs or hadoop commands remotely.

    basic set-up of hadoop on OSX yosemite

    Why would I do this?

    An OSX laptop will not allow to do any larger scale data processing, but it may be convenient place to develop/debug hadoop scripts before running on a real cluster. For this you likely want to have a local hadoop “cluster” to play with, and use the local commands as client for an larger remote hadoop cluster. This post covers the local install and basic testing. A second post shows how to extend the setup for accessing /processing against a remote kerberized cluster.

    Getting prepared

    If you don’t yet have java (yosemite does not actually come with it) then the first step is to download the installer from the Oracle download site. Once installed you should get in a terminal shell something like:

    $ java -version
    java version "1.7.0_45"
    Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
    Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
    

    If you need to have several java versions installed and want to be able to switch between them: take have a look at the nice description here.

    If you don’t yet have the homebrew package manager installed then get it now by following the (one line) installation on http://brew.sh. Homebrew packages live in /usr/local and rarely interfere with other stuff on your machine (unless you ask them to). Install the hadoop package as a normal user using:

    $ brew install hadoop
    

    (At the time of writing I got hadoop 2.5.1)

    BTW: Once you start using brew also for other packages, be careful when using brew upgrade. Eg you may want to use brew pin to avoid getting eg a new hadoop versions installed, while doing other package upgrades.

    Configure

    Next stop: edit a few config files: In .[z]profile you may want to add a few shortcuts to quickly jump to the relevant places or to be able to switch between hadoop and java versions, but this is not strictly required to run hadoop.

    export JAVA_HOME=$(/usr/libexec/java_home)
    export HADOOP_VERSION=2.5.1
    export HADOOP_BASE=/usr/local/Cellar/hadoop/${HADOOP_VERSION}
    export HADOOP_HOME=$HADOOP_BASE/libexec
    

    Now you should edit a few hadoop files in your hadoop configuration directory:

    cd $HADOOP_HOME/etc/hadoop
    

    in core-site.xml expand the configuration to:

    <configuration>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
        <description>A base for other temporary directories.</description>
      </property>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
      </property>
    </configuration>
    

    In hdfs-site.xml:

    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
    </configuration>
    

    and finally in mapred-site.xml:

    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9010</value>
      </property>
    </configuration>
    

    Now its time to:

    • Initialise hdfs
    $ hadoop namenode -format
    
    • Start hdfs and yarn
    ## start the hadoop daemons (move to launchd plist to do this automatically)
    $ $HADOOP_BASE/sbin/start-dfs.sh
    $ $HADOOP_BASE/sbin/start-yarn.sh
    
    • Test your hdfs setup
    ## show (still empty) homedir in hdfs
    $ hdfs dfs -ls
    ## put some local file
    $ hdfs dfs -put myfile.txt
    ## now we should see the new file
    $ hdfs dfs -ls
    

    Work around an annoying Kerberos realm problem on OSX

    The hadoop setup will at this point likely still complain with a message like Unable to load realm info from SCDynamicStore, which is caused by a java bug on OSX (more details here).

    There are different ways to work around this, depending on whether you just want to get a local hadoop installation going or need your hadoop client to (also) access a remote kerberized hadoop cluster.

    To get java running on the local (non-kerberized) setup, it is sufficient to just add some definitions to $HADOOP_OPTS (and $YARN_OPTS for yarn) in .[z]profile as described in this post.

    The actual hostname probably does not matter too much, as you won’t do an actual kerberos exchange locally, but just get past the flawed “do we know a default realm” check in java.

    In case you are planning to access a kerberized hadoop cluster please continue reading the next post.

    Cleaning up

    Some of the default logging settings make hadoop rather chatty on the console about deprecated configuration keys and other things. On OSX there are a few items that get nagging after a while as they make it harder to spot real problems. You may want to adjust the log4j settings to mute warnings that you don’t want to see every single time you enter a hadoop command. In $HADOOP_HOME/etc/hadoop/log4j.properties you could add:

    # Logging Threshold
    log4j.threshold=ALL
    # the native libs don't exist for OSX
    log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
    # yes, we'll keep in mind that some things are deprecated
    log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=ERROR