Why would I do this?
An OSX laptop will not allow to do any larger scale data processing, but it may be convenient place to develop/debug hadoop scripts before running on a real cluster. For this you likely want to have a local hadoop “cluster” to play with, and use the local commands as client for an larger remote hadoop cluster. This post covers the local install and basic testing. A second post shows how to extend the setup for accessing /processing against a remote kerberized cluster.
Getting prepared
If you don’t yet have java (yosemite does not actually come with it) then the first step is to download the installer from the Oracle download site. Once installed you should get in a terminal shell something like:
$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
If you need to have several java versions installed and want to be able to switch between them: take have a look at the nice description here.
If you don’t yet have the homebrew package manager installed then get it now
by following the (one line) installation on http://brew.sh. Homebrew packages
live in /usr/local
and rarely interfere with other stuff on your machine
(unless you ask them to). Install the hadoop package as a normal user using:
$ brew install hadoop
(At the time of writing I got hadoop 2.5.1)
BTW: Once you start using brew also for other packages, be careful when
using brew upgrade
. Eg you may want to use brew pin
to avoid getting eg a new
hadoop versions installed, while doing other package upgrades.
Configure
Next stop: edit a few config files: In .[z]profile
you may want to add a few
shortcuts to quickly jump to the relevant places or to be able to switch
between hadoop and java versions, but this is not strictly required to run hadoop.
export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_VERSION=2.5.1
export HADOOP_BASE=/usr/local/Cellar/hadoop/${HADOOP_VERSION}
export HADOOP_HOME=$HADOOP_BASE/libexec
Now you should edit a few hadoop files in your hadoop configuration directory:
cd $HADOOP_HOME/etc/hadoop
in core-site.xml
expand the configuration to:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/Cellar/hadoop/hdfs/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
In hdfs-site.xml
:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
and finally in mapred-site.xml
:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9010</value>
</property>
</configuration>
Now its time to:
- Initialise hdfs
$ hadoop namenode -format
- Start hdfs and yarn
## start the hadoop daemons (move to launchd plist to do this automatically)
$ $HADOOP_BASE/sbin/start-dfs.sh
$ $HADOOP_BASE/sbin/start-yarn.sh
- Test your hdfs setup
## show (still empty) homedir in hdfs
$ hdfs dfs -ls
## put some local file
$ hdfs dfs -put myfile.txt
## now we should see the new file
$ hdfs dfs -ls
Work around an annoying Kerberos realm problem on OSX
The hadoop setup will at this point likely still complain with a message
like Unable to load realm info from SCDynamicStore
, which is caused by a java
bug on OSX (more details here).
There are different ways to work around this, depending on whether you just want to get a local hadoop installation going or need your hadoop client to (also) access a remote kerberized hadoop cluster.
To get java running on the local (non-kerberized) setup, it is
sufficient to just add some definitions to $HADOOP_OPTS
(and $YARN_OPTS
for
yarn) in .[z]profile
as described in this post.
The actual hostname probably does not matter too much, as you won’t do an actual kerberos exchange locally, but just get past the flawed “do we know a default realm” check in java.
In case you are planning to access a kerberized hadoop cluster please continue reading the next post.
Cleaning up
Some of the default logging settings make hadoop rather chatty on the console
about deprecated configuration keys and other things. On OSX there are a few
items that get nagging after a while as they make it harder to spot real
problems. You may want to adjust the log4j
settings to mute warnings that you
don’t want to see every single time you enter a hadoop command. In
$HADOOP_HOME/etc/hadoop/log4j.properties
you could add:
# Logging Threshold
log4j.threshold=ALL
# the native libs don't exist for OSX
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
# yes, we'll keep in mind that some things are deprecated
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=ERROR