Why would I do this?

An OSX laptop will not allow to do any larger scale data processing, but it may be convenient place to develop/debug hadoop scripts before running on a real cluster. For this you likely want to have a local hadoop “cluster” to play with, and use the local commands as client for an larger remote hadoop cluster. This post covers the local install and basic testing. A second post shows how to extend the setup for accessing /processing against a remote kerberized cluster.

Getting prepared

If you don’t yet have java (yosemite does not actually come with it) then the first step is to download the installer from the Oracle download site. Once installed you should get in a terminal shell something like:

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

If you need to have several java versions installed and want to be able to switch between them: take have a look at the nice description here.

If you don’t yet have the homebrew package manager installed then get it now by following the (one line) installation on http://brew.sh. Homebrew packages live in /usr/local and rarely interfere with other stuff on your machine (unless you ask them to). Install the hadoop package as a normal user using:

$ brew install hadoop

(At the time of writing I got hadoop 2.5.1)

BTW: Once you start using brew also for other packages, be careful when using brew upgrade. Eg you may want to use brew pin to avoid getting eg a new hadoop versions installed, while doing other package upgrades.

Configure

Next stop: edit a few config files: In .[z]profile you may want to add a few shortcuts to quickly jump to the relevant places or to be able to switch between hadoop and java versions, but this is not strictly required to run hadoop.

export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_VERSION=2.5.1
export HADOOP_BASE=/usr/local/Cellar/hadoop/${HADOOP_VERSION}
export HADOOP_HOME=$HADOOP_BASE/libexec

Now you should edit a few hadoop files in your hadoop configuration directory:

cd $HADOOP_HOME/etc/hadoop

in core-site.xml expand the configuration to:

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

In hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

and finally in mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9010</value>
  </property>
</configuration>

Now its time to:

  • Initialise hdfs
$ hadoop namenode -format
  • Start hdfs and yarn
## start the hadoop daemons (move to launchd plist to do this automatically)
$ $HADOOP_BASE/sbin/start-dfs.sh
$ $HADOOP_BASE/sbin/start-yarn.sh
  • Test your hdfs setup
## show (still empty) homedir in hdfs
$ hdfs dfs -ls
## put some local file
$ hdfs dfs -put myfile.txt
## now we should see the new file
$ hdfs dfs -ls

Work around an annoying Kerberos realm problem on OSX

The hadoop setup will at this point likely still complain with a message like Unable to load realm info from SCDynamicStore, which is caused by a java bug on OSX (more details here).

There are different ways to work around this, depending on whether you just want to get a local hadoop installation going or need your hadoop client to (also) access a remote kerberized hadoop cluster.

To get java running on the local (non-kerberized) setup, it is sufficient to just add some definitions to $HADOOP_OPTS (and $YARN_OPTS for yarn) in .[z]profile as described in this post.

The actual hostname probably does not matter too much, as you won’t do an actual kerberos exchange locally, but just get past the flawed “do we know a default realm” check in java.

In case you are planning to access a kerberized hadoop cluster please continue reading the next post.

Cleaning up

Some of the default logging settings make hadoop rather chatty on the console about deprecated configuration keys and other things. On OSX there are a few items that get nagging after a while as they make it harder to spot real problems. You may want to adjust the log4j settings to mute warnings that you don’t want to see every single time you enter a hadoop command. In $HADOOP_HOME/etc/hadoop/log4j.properties you could add:

# Logging Threshold
log4j.threshold=ALL
# the native libs don't exist for OSX
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
# yes, we'll keep in mind that some things are deprecated
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=ERROR