Category Archives: RStudio

Using Data Frames in Feather format (Apache Arrow)

Triggered by the RStudio blog article about feather I did the one line install and compared the results on a data frame of 19 million rows. First results look indeed promising:

# build the package
> devtools::install_github("wesm/feather/R")

# load an existing data frame (19 million rows with batch job execution results)
> load("batch-12-2015.rda")

# write it in feather format...
> write_feather(dt,"batch-12-2015.feather")

# ... which is not compressed, hence larger on disk
> system("ls -lh batch-12-2015.*")
-rw-r--r-- 1 dirkd staff 813M 7 Apr 11:35 batch-12-2015.feather
-rw-r--r-- 1 dirkd staff 248M 27 Jan 22:42 batch-12-2015.rda

# a few repeat reads on an older macbook with sdd
> system.time(load("batch-12-2015.rda"))
user system elapsed
8.984 0.332 9.331
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.103 1.094 7.978
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.045 0.352 9.418
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.110 0.658 3.997
> system.time(load("batch-12-2015.rda"))
user system elapsed
9.009 0.356 9.393
> system.time(dt1 <- read_feather("batch-12-2015.feather"))
user system elapsed
1.099 0.711 4.548

So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) ! Of course these measurements are from file system cache rather than the laptop SSD, but the reduction in wall time is nice for larger volume loads.

More important though is the cross-language support for R, Python, Scala/Spark and others, which could make feather the obvious exchange format within a team or between workflow steps that use different implementation languages.

Setting up an RStudio server for iPad access

Sometimes it can be convenient to run RStudio remotely from an iPad or another machine with little RAM or disk space. This can be done quite easily using the free RStudio Server on OSX via docker. To do this:

  • Find the rocker/rstudio image on docker hub and follow the setup steps here [github].
  • Once the image is running, you should be able to connect with Safari on the host Mac to the login page eg at
    $ open http://192.168.99.100:8787
    
  • Now there is is only a small last step needed. You need to expose the server port from the host on the local network using the OSX firewall. In the somewhat explicit language of the “new” OSX firewall this can be done using:

    $ echo "rdr pass inet proto tcp from any to any port 8787 -> 127.0.0.1 port 8787" | sudo pfctl -ef -
    

    At this point you should be able to connect remotely from your iPad to

    http://<main-mac-ip-or-name>:8787
    

    and continue your R session where you left it before eg on your main machine.

    BTW: If your network can not be trusted then you should probably change the default login credentials as described in the image docs.

    hadoop + kerberos with GUI programs (eg RStudio with RHadoop)

    While the setup from the previous posts works for the hadoop shell commands, you will still fail to access the remote cluster from GUI programs (eg RStudio) and/or with hadoop plugins like RHadoop.

    There are two reasons for that:

    • GUI programs do not inherit your terminal/shell enviroment variables – unless you start them from a terminal session with
    $ open /Applications/RStudio.app
    
    • $HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even if the variables are present in their execution environment.

    The first problem is well covered by various blog posts. The main difficulty is only to find the correct procedure for your OSX version,since Apple has changed several times over the years:

    • using a .plist file in ~ /.MacOS (before Maverics)
    • using a setenv statement line /etc/launchd.conf (Mavericks)
    • using the launchctl setenv command (from Yosemite)

    To find out which variable is used inside your GUI program or plugin may need some experimentation or look at the source. For java based plugins the variable _JAVA_OPTIONS which is always evaluated may be a starting point. For RHadoop package the more specific HADOOP_OPTS is already sufficient, so on yosemite:

    $ launchctl setenv HADOOP_OPTS "-Djava.security.krb5.conf=/etc/krb5.conf"
    # prefix command with sudo in case you want the setting for all users
    

    If you need the setting only inside R/RStudio you could simply add the enviroment setting in your R scripts before initialising the RHadoop packages.

    # wrapper script:  hadoop --config ~/remote-hadoop-conf
    hadoop.command <- "~/scripts/remote-hadoop"
    
    Sys.setenv(HADOOP_OPTS ="-Djava.security.krb5.conf=/etc/krb5.conf")
    Sys.setenv(HADOOP_CMD=hadoop.command)
    
    # load hdfs plugin for R
    library(rhdfs)
    hdfs.init()
    
    # print remote hdfs root directory
    print(hdfs.ls("/"))