R | analytx.info

Monty Hall problem - a small simulation in R

The Monty Hall problem is an interesting example for how much intuition can mislead us in some statistical contexts. Even more disturbing though is, for how long we are prepared to debate and defend an expected result before actually checking our initial guesses using a simple Monte Carlo simulation. Here is simple simulation implementation the Monty Hall game show problem: In the TV show “Let’s Make a Deal” the host Monty Hall would offer to game participant the choice of three doors. One of them was hiding a valuable price (eg a car) - behind the other two doors were only two less desirable goats. ...

Slides and blog posts with R and emacs org-mode

Preparing a larger number of slides with R code and plots can be a bit tedious with standard desktop presentation software like powerpoint or keynote. The manual effort to change the example code, run the analysis and then cut and paste updated graphs, tables and code is high. Sooner or later one is bound to create inconsistencies between code and expected results or even syntax errors ...

Using Data Frames in Feather format (Apache Arrow)

Triggered by the RStudio blog article about feather I did the one line install and compared the results on a data frame of 19 million rows. First results look indeed promising: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # build the package > devtools::install_github("wesm/feather/R") # load an existing data frame (19 million rows with batch job execution results) > load("batch-12-2015.rda") # write it in feather format... > write_feather(dt,"batch-12-2015.feather") # ... which is not compressed, hence larger on disk > system("ls -lh batch-12-2015.*") -rw-r--r-- 1 dirkd staff 813M 7 Apr 11:35 batch-12-2015.feather -rw-r--r-- 1 dirkd staff 248M 27 Jan 22:42 batch-12-2015.rda # a few repeat reads on an older macbook with sdd > system.time(load("batch-12-2015.rda")) user system elapsed 8.984 0.332 9.331 > system.time(dt1 <- read_feather("batch-12-2015.feather")) user system elapsed 1.103 1.094 7.978 > system.time(load("batch-12-2015.rda")) user system elapsed 9.045 0.352 9.418 > system.time(dt1 <- read_feather("batch-12-2015.feather")) user system elapsed 1.110 0.658 3.997 > system.time(load("batch-12-2015.rda")) user system elapsed 9.009 0.356 9.393 > system.time(dt1 <- read_feather("batch-12-2015.feather")) user system elapsed 1.099 0.711 4.548 So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) ! Of course these measurements are from file system cache rather than the laptop SSD, but the reduction in wall time is nice for larger volume loads. ...

Setting up an RStudio server for iPad access

Sometimes it can be convenient to run RStudio remotely from an iPad or another machine with little RAM or disk space. This can be done quite easily using the free RStudio Server on OSX via docker. To do this: Find the rocker/rstudio image on docker hub and follow the setup steps here at github. Once the image is running, you should be able to connect with Safari on the host Mac to the login page eg at ...

Cached, asychronous IP resolution

Resolving IP addresses to host names is quite helpful for getting a quick overview of who is connecting from where. This may need some care to not put too much strain on your DNS server with a large number of repeated lookups. Also you may not want to wait for timeouts on IPs that do not resove. R itself is not supporting this specifically but can easily exploit asyncronous DNS lookup tools like adns (on OSX from homebrew) and provide a cache to speed things up. Here is an simple example for a vertorised lookup using a data.table as persistent cache. ...

Using R for weblog analysis

Apache Weblog Analysis Whether you run your own blog or web server or use some hosted service – at some point you may be interested in some information on how well your server or your users are doing. Many infos like hit frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can be obtained in different ways: by instrumenting the page running inside the client browser (eg piwik) by analysis of the web server logs (eg webalizer) For the latter I have been using for several years webalizer, which does nice web based analysis plots. More recently I moved to a more complicated server environment with several virtual web services and I found the configuration and data selection options a bit limting. Hence I started as a toy project to implement the same functionality with a set of simple R scripts, which I will progressively share here. ...

Getting hold of remote weblogs

The last post was assuming that the weblogs to analyse are directly accessible by the R session which may not be the case if your analysis is running on a remote machine. Also in some cases you may want to filter out some uninteresting log records (eg local clients on the web server or local area accesses from known clients). The next examples show how to modify the previous R script using the R pipe function to take this into account: ...

Connect to a remote, kerberized hadoop cluster

To use a remote hadoop cluster with kerberos authentication you will need to get a proper krb5.conf file (eg from your remote cluster /etc/kerb5.conf) and place the file /etc/krb5.conf on your client OSX machine. To use this configurations from your osx hadoop client change your .[z]profile to: export HADOOP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf" export YARN_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf" With java 1.7 this should be sufficient to detect the default realm, the kdc and also any specific authentication options used by your site. Please make sure the kerberos configuration is already in place when you obtain your ticket with ...