Hadoop + kerberos with RStudio
By Dirk Duellmann
While the setup from the previous posts works for the hadoop shell commands, you will still fail to access the remote cluster from GUI programs (eg RStudio) and/or with hadoop plugins like RHadoop.
There are two reasons for that:
- GUI programs do not inherit your terminal/shell enviroment variables - unless you start them from a terminal session with
$ open /Applications/RStudio.app
- $HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even if the variables are present in their execution environment.
The first problem is well covered by various blog posts. The main difficulty is only to find the correct procedure for your OSX version,since Apple has changed several times over the years:
- using a .plist file in ~
/.MacOS
(before Maverics) - using a setenv statement line
/etc/launchd.conf
(Mavericks) - using the
launchctl setenv
command (from Yosemite)
To find out which variable is used inside your GUI program or plugin may need
some experimentation or look at the source. For java based plugins the variable
_JAVA_OPTIONS
which is always evaluated may be a starting point. For RHadoop
package the more specific HADOOP_OPTS is already sufficient, so on yosemite:
$ launchctl setenv HADOOP_OPTS "-Djava.security.krb5.conf=/etc/krb5.conf"
# prefix command with sudo in case you want the setting for all users
If you need the setting only inside R/RStudio you could simply add the enviroment setting in your R scripts before initialising the RHadoop packages.
# wrapper script: hadoop --config ~/remote-hadoop-conf
hadoop.command <- "~/scripts/remote-hadoop"
Sys.setenv(HADOOP_OPTS ="-Djava.security.krb5.conf=/etc/krb5.conf")
Sys.setenv(HADOOP_CMD=hadoop.command)
# load hdfs plugin for R
library(rhdfs)
hdfs.init()
# print remote hdfs root directory
print(hdfs.ls("/"))