<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Big-Data on analytx.info</title>
    <link>https://analytx.info/tags/big-data/</link>
    <description>Recent content in Big-Data on analytx.info</description>
    <image>
      <title>analytx.info</title>
      <url>https://analytx.info/images/hubble.jpg</url>
      <link>https://analytx.info/images/hubble.jpg</link>
    </image>
    <generator>Hugo -- 0.157.0</generator>
    <language>en-us</language>
    <lastBuildDate>Tue, 18 Nov 2014 21:49:00 +0100</lastBuildDate>
    <atom:link href="https://analytx.info/tags/big-data/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Basic set-up of hadoop on OSX yosemite</title>
      <link>https://analytx.info/2014/11/18/basic-set-up-of-hadoop-on-osx-yosemite/</link>
      <pubDate>Tue, 18 Nov 2014 21:49:00 +0100</pubDate>
      <guid>https://analytx.info/2014/11/18/basic-set-up-of-hadoop-on-osx-yosemite/</guid>
      <description>Complete guide to installing and configuring Apache Hadoop on macOS Yosemite using Homebrew. Includes HDFS setup, troubleshooting Kerberos issues, and preparing for remote cluster access.</description>
      <content:encoded><![CDATA[<h2 id="why-would-i-do-this">Why would I do this?</h2>
<p>An OSX laptop will not allow to do any larger scale data processing, but it may
be convenient place to develop/debug  hadoop scripts before running on a real
cluster. For this you likely want to have a local hadoop “cluster” to play
with, and use the local commands as client for an larger remote hadoop
cluster. This post covers the local install and basic testing. A second post
shows how to extend the setup for accessing /processing against a remote
kerberized cluster.</p>
<h2 id="getting-prepared">Getting  prepared</h2>
<p>If you don’t yet have java (yosemite does not actually come with it) then the
first step is to download the installer from the Oracle <a href="https://www.java.com/en/download/index.jsp">download site</a>.  Once
installed you should get in a terminal shell something like:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ java -version
</span></span><span style="display:flex;"><span>java version <span style="color:#e6db74">&#34;1.7.0_45&#34;</span>
</span></span><span style="display:flex;"><span>Java<span style="color:#f92672">(</span>TM<span style="color:#f92672">)</span> SE Runtime Environment <span style="color:#f92672">(</span>build 1.7.0_45-b18<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span>Java HotSpot<span style="color:#f92672">(</span>TM<span style="color:#f92672">)</span> 64-Bit Server VM <span style="color:#f92672">(</span>build 24.45-b08, mixed mode<span style="color:#f92672">)</span>
</span></span></code></pre></div><p>If you need to have several java versions installed and want to be able to
switch between them: take have a look at the nice description <a href="http://java.dzone.com/articles/multiple-versions-java-os-x">here</a>.</p>
<p>If you don’t yet have the homebrew package manager installed then get it now
by following the (one line) installation on <a href="http://brew.sh">http://brew.sh</a>. Homebrew packages
live in <code>/usr/local</code> and rarely interfere with other stuff on your machine
(unless you ask them to). Install the hadoop package as a normal user using:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ brew install hadoop
</span></span></code></pre></div><p>(At the time of writing I got hadoop 2.5.1)</p>
<p>BTW: Once you start using brew also for other packages, be careful when
using <code>brew upgrade</code>. Eg you may want to use <code>brew pin</code> to avoid getting eg a new
hadoop versions installed, while doing other package upgrades.</p>
<h2 id="configure">Configure</h2>
<p>Next stop: edit a few config files: In <code>.[z]profile</code> you may want to add a few
shortcuts to quickly jump to the relevant places or to be able to switch
between hadoop and java versions, but this is not strictly required to run hadoop.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>export JAVA_HOME<span style="color:#f92672">=</span><span style="color:#66d9ef">$(</span>/usr/libexec/java_home<span style="color:#66d9ef">)</span>
</span></span><span style="display:flex;"><span>export HADOOP_VERSION<span style="color:#f92672">=</span>2.5.1
</span></span><span style="display:flex;"><span>export HADOOP_BASE<span style="color:#f92672">=</span>/usr/local/Cellar/hadoop/<span style="color:#e6db74">${</span>HADOOP_VERSION<span style="color:#e6db74">}</span>
</span></span><span style="display:flex;"><span>export HADOOP_HOME<span style="color:#f92672">=</span>$HADOOP_BASE/libexec
</span></span></code></pre></div><p>Now you should edit a few hadoop files in your hadoop configuration directory:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>cd $HADOOP_HOME/etc/hadoop
</span></span></code></pre></div><p>in <code>core-site.xml</code> expand the configuration to:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>hadoop.tmp.dir<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>/usr/local/Cellar/hadoop/hdfs/tmp<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;description&gt;</span>A base for other temporary directories.<span style="color:#f92672">&lt;/description&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>fs.default.name<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>hdfs://localhost:9000<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>In <code>hdfs-site.xml</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>dfs.replication<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>1<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>and finally in <code>mapred-site.xml</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>mapred.job.tracker<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>localhost:9010<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>Now its time to:</p>
<ul>
<li>Initialise hdfs</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ hadoop namenode -format
</span></span></code></pre></div><ul>
<li>Start hdfs and yarn</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e">## start the hadoop daemons (move to launchd plist to do this automatically)</span>
</span></span><span style="display:flex;"><span>$ $HADOOP_BASE/sbin/start-dfs.sh
</span></span><span style="display:flex;"><span>$ $HADOOP_BASE/sbin/start-yarn.sh
</span></span></code></pre></div><ul>
<li>Test your hdfs setup</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e">## show (still empty) homedir in hdfs</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -ls
</span></span><span style="display:flex;"><span><span style="color:#75715e">## put some local file</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -put myfile.txt
</span></span><span style="display:flex;"><span><span style="color:#75715e">## now we should see the new file</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -ls
</span></span></code></pre></div><h2 id="work-around-an-annoying-kerberos-realm-problem-on-osx">Work around an annoying Kerberos realm problem on OSX</h2>
<p>The hadoop setup will at this point likely still complain with a message
like <code>Unable to load realm info from SCDynamicStore</code>, which is caused by a java
bug on OSX (more <a href="http://mail.openjdk.java.net/pipermail/macosx-port-dev/2013-March/005443.html">details here</a>).</p>
<p>There are different ways to work around this, depending on whether you just
want to get a local hadoop installation going or need your hadoop client to
(also) access a remote kerberized hadoop cluster.</p>
<p>To get java running on the local (non-kerberized) setup, it is
sufficient to just add some definitions to <code>$HADOOP_OPTS</code> (and <code>$YARN_OPTS</code> for
yarn) in <code>.[z]profile</code> as described in <a href="http://stackoverflow.com/questions/7134723/hadoop-on-osx-unable-to-load-realm-info-from-scdynamicstore">this post</a>.</p>
<p>The actual hostname probably does not matter too much, as you won’t do an
actual kerberos exchange locally, but just get past the flawed
“do we know a default realm” check in java.</p>
<p>In case you are planning to access a kerberized hadoop cluster
please continue reading the next post.</p>
<h2 id="cleaning-up">Cleaning up</h2>
<p>Some of the default logging settings make hadoop rather chatty on the console
about deprecated configuration keys and other things. On OSX there are a few
items that get nagging after a while as they make it harder to spot real
problems.  You may want to adjust the <code>log4j</code> settings to mute warnings that you
don’t want to see every single time you enter a hadoop command. In
<code>$HADOOP_HOME/etc/hadoop/log4j.properties</code> you could add:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-properties" data-lang="properties"><span style="display:flex;"><span><span style="color:#75715e"># Logging Threshold</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.threshold</span><span style="color:#f92672">=</span><span style="color:#e6db74">ALL</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># the native libs don&#39;t exist for OSX</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.logger.org.apache.hadoop.util.NativeCodeLoader</span><span style="color:#f92672">=</span><span style="color:#e6db74">ERROR</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># yes, we&#39;ll keep in mind that some things are deprecated</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.logger.org.apache.hadoop.conf.Configuration.deprecation</span><span style="color:#f92672">=</span><span style="color:#e6db74">ERROR</span>
</span></span></code></pre></div>]]></content:encoded>
    </item>
  </channel>
</rss>
