Category Archives: yosemite

hadoop + kerberos with GUI programs (eg RStudio with RHadoop)

While the setup from the previous posts works for the hadoop shell commands, you will still fail to access the remote cluster from GUI programs (eg RStudio) and/or with hadoop plugins like RHadoop.

There are two reasons for that:

  • GUI programs do not inherit your terminal/shell enviroment variables – unless you start them from a terminal session with
$ open /Applications/RStudio.app
  • $HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even if the variables are present in their execution environment.

The first problem is well covered by various blog posts. The main difficulty is only to find the correct procedure for your OSX version,since Apple has changed several times over the years:

  • using a .plist file in ~ /.MacOS (before Maverics)
  • using a setenv statement line /etc/launchd.conf (Mavericks)
  • using the launchctl setenv command (from Yosemite)

To find out which variable is used inside your GUI program or plugin may need some experimentation or look at the source. For java based plugins the variable _JAVA_OPTIONS which is always evaluated may be a starting point. For RHadoop package the more specific HADOOP_OPTS is already sufficient, so on yosemite:

$ launchctl setenv HADOOP_OPTS "-Djava.security.krb5.conf=/etc/krb5.conf"
# prefix command with sudo in case you want the setting for all users

If you need the setting only inside R/RStudio you could simply add the enviroment setting in your R scripts before initialising the RHadoop packages.

# wrapper script:  hadoop --config ~/remote-hadoop-conf
hadoop.command <- "~/scripts/remote-hadoop"

Sys.setenv(HADOOP_OPTS ="-Djava.security.krb5.conf=/etc/krb5.conf")
Sys.setenv(HADOOP_CMD=hadoop.command)

# load hdfs plugin for R
library(rhdfs)
hdfs.init()

# print remote hdfs root directory
print(hdfs.ls("/"))

Connect to a remote, kerberized hadoop cluster

To use a remote hadoop cluster with kerberos authentication you will need to get a proper krb5.conf file (eg from your remote cluster /etc/kerb5.conf) and place the file /etc/krb5.conf on your client OSX machine. To use this configurations from your osx hadoop client change your .[z]profile to:

export HADOOP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf"
export YARN_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf"

With java 1.7 this should be sufficient to detect the default realm, the kdc and also any specific authentication options used by your site. Please make sure the kerberos configuration is already in place when you obtain your ticket with

$ kinit

In case you got a ticket beforehand you may have to execute kinit again or login to local account again.

For the next step you will need to obtain the remote cluster configuration files (eg scp the config files from the remote cluster to a local directory, eg to ~/remote-hadoop-conf). The result should be a local copy similar to this:

$ ls -l  ~/remote-hadoop-conf

total 184
-rw-r--r--  1 dirkd  staff  4146 Jun 25  2013 capacity-scheduler.xml
-rw-r--r--  1 dirkd  staff  4381 Oct 21 11:44 core-site.xml
-rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 dfs.includes
-rw-r--r--  1 dirkd  staff     0 Jun 25  2013 excludes
-rw-r--r--  1 dirkd  staff   896 Dec  1 11:44 hadoop-env.sh
-rw-r--r--  1 dirkd  staff  3251 Aug  5 09:50 hadoop-metrics.properties
-rw-r--r--  1 dirkd  staff  4214 Oct  7  2013 hadoop-policy.xml
-rw-r--r--  1 dirkd  staff  7283 Nov  3 16:44 hdfs-site.xml
-rw-r--r--  1 dirkd  staff  8713 Nov 18 16:26 log4j.properties
-rw-r--r--  1 dirkd  staff  6112 Nov  5 16:52 mapred-site.xml
-rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 mapred.includes
-rw-r--r--  1 dirkd  staff   127 Apr  4  2014 taskcontroller.cfg
-rw-r--r--  1 dirkd  staff   931 Oct 20 09:44 topology.table.file
-rw-r--r--  1 dirkd  staff    70 Jul  2 11:52 yarn-env.sh
-rw-r--r--  1 dirkd  staff  5559 Nov  5 16:52 yarn-site.xml

Then point your hadoop and hdfs command to this configuration:

$ hdfs --config ~/remote-hadoop-conf dfs -ls /

If all worked well, then you should see at this point the content of the remote hdfs directory and you will be ready to use the standard hdfs or hadoop commands remotely.

basic set-up of hadoop on OSX yosemite

Why would I do this?

An OSX laptop will not allow to do any larger scale data processing, but it may be convenient place to develop/debug hadoop scripts before running on a real cluster. For this you likely want to have a local hadoop “cluster” to play with, and use the local commands as client for an larger remote hadoop cluster. This post covers the local install and basic testing. A second post shows how to extend the setup for accessing /processing against a remote kerberized cluster.

Getting prepared

If you don’t yet have java (yosemite does not actually come with it) then the first step is to download the installer from the Oracle download site. Once installed you should get in a terminal shell something like:

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

If you need to have several java versions installed and want to be able to switch between them: take have a look at the nice description here.

If you don’t yet have the homebrew package manager installed then get it now by following the (one line) installation on http://brew.sh. Homebrew packages live in /usr/local and rarely interfere with other stuff on your machine (unless you ask them to). Install the hadoop package as a normal user using:

$ brew install hadoop

(At the time of writing I got hadoop 2.5.1)

BTW: Once you start using brew also for other packages, be careful when using brew upgrade. Eg you may want to use brew pin to avoid getting eg a new hadoop versions installed, while doing other package upgrades.

Configure

Next stop: edit a few config files: In .[z]profile you may want to add a few shortcuts to quickly jump to the relevant places or to be able to switch between hadoop and java versions, but this is not strictly required to run hadoop.

export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_VERSION=2.5.1
export HADOOP_BASE=/usr/local/Cellar/hadoop/${HADOOP_VERSION}
export HADOOP_HOME=$HADOOP_BASE/libexec

Now you should edit a few hadoop files in your hadoop configuration directory:

cd $HADOOP_HOME/etc/hadoop

in core-site.xml expand the configuration to:

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

In hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

and finally in mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9010</value>
  </property>
</configuration>

Now its time to:

  • Initialise hdfs
$ hadoop namenode -format
  • Start hdfs and yarn
## start the hadoop daemons (move to launchd plist to do this automatically)
$ $HADOOP_BASE/sbin/start-dfs.sh
$ $HADOOP_BASE/sbin/start-yarn.sh
  • Test your hdfs setup
## show (still empty) homedir in hdfs
$ hdfs dfs -ls
## put some local file
$ hdfs dfs -put myfile.txt
## now we should see the new file
$ hdfs dfs -ls

Work around an annoying Kerberos realm problem on OSX

The hadoop setup will at this point likely still complain with a message like Unable to load realm info from SCDynamicStore, which is caused by a java bug on OSX (more details here).

There are different ways to work around this, depending on whether you just want to get a local hadoop installation going or need your hadoop client to (also) access a remote kerberized hadoop cluster.

To get java running on the local (non-kerberized) setup, it is sufficient to just add some definitions to $HADOOP_OPTS (and $YARN_OPTS for yarn) in .[z]profile as described in this post.

The actual hostname probably does not matter too much, as you won’t do an actual kerberos exchange locally, but just get past the flawed “do we know a default realm” check in java.

In case you are planning to access a kerberized hadoop cluster please continue reading the next post.

Cleaning up

Some of the default logging settings make hadoop rather chatty on the console about deprecated configuration keys and other things. On OSX there are a few items that get nagging after a while as they make it harder to spot real problems. You may want to adjust the log4j settings to mute warnings that you don’t want to see every single time you enter a hadoop command. In $HADOOP_HOME/etc/hadoop/log4j.properties you could add:

# Logging Threshold
log4j.threshold=ALL
# the native libs don't exist for OSX
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
# yes, we'll keep in mind that some things are deprecated
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=ERROR