<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Posts on analytx.info</title>
    <link>https://analytx.info/post/</link>
    <description>Recent content in Posts on analytx.info</description>
    <image>
      <title>analytx.info</title>
      <url>https://analytx.info/images/hubble.jpg</url>
      <link>https://analytx.info/images/hubble.jpg</link>
    </image>
    <generator>Hugo -- 0.157.0</generator>
    <language>en-us</language>
    <lastBuildDate>Mon, 15 Sep 2025 00:00:00 +0200</lastBuildDate>
    <atom:link href="https://analytx.info/post/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>A small step for this blog, but a big help for me…</title>
      <link>https://analytx.info/2025/09/15/a-small-step-for-this-blog-but-a-big-help-for-me/</link>
      <pubDate>Mon, 15 Sep 2025 00:00:00 +0200</pubDate>
      <guid>https://analytx.info/2025/09/15/a-small-step-for-this-blog-but-a-big-help-for-me/</guid>
      <description>Using Claude Code to revive a Hugo blog after years of inactivity — upgrading themes, fixing incompatibilities, and changing the workflow without leaving Emacs.</description>
      <content:encoded><![CDATA[<p>After quite some time of inactivity (eg day-work) I revived this almost abandoned blog as test-case for playing with 2025 AI coding tools.</p>
<p>My problem:</p>
<p>as many blogs, the gaps between new articles became so long, that often needed to evolve the already minimal blog implementation.</p>
<p>I realized that it took more time to upgrade hugo and adapt to the changes than the time I spent on writing actual content.</p>
<p>Hence <a href="https://www.anthropic.com/claude-code">claude-code</a> was a rather welcome and capable help to deal with these boiler-plate changes:</p>
<ul>
<li>
<p><strong>simple:</strong> I do not use twitter / X anymore but moved to mastodon</p>
<p><em>claude$</em> can you change social coordinates in my blog?</p>
</li>
</ul>
<p>This is a simple config change, but typing the request into a simple terminal line is still a very different way of coding than only a few years ago (eg google/stack-overflow, edit, debug, deploy).</p>
<p>Together with the nice emacs package <a href="https://github.com/manzaltu/claude-code-ide.el">claude-code-ide.el</a> this can all take place without a single open source-code buffer.
Similar to human coding partners - a small, but relevant set of questions about the task goals and alternative implementation options asked by the tool helps to gain confidence.</p>
<p>Even more convincing (even if not strictly required) is watching the log of actions and applied changes, detected problems and found fixes.
Watching this exposes the intricate beauty of watching someone else working - and must inspire the rather subjective experience of increased efficiency that is reported by many. Probably similar to the fulfilling experience of watching an experienced Japanese swordmaker on youtube doing work you would neither be able (nor usually be required) to do. Hopefully the loss of efficiency by people watching reels will be compensated by the gain in efficiency by coders - but I might be diverging&hellip;</p>
<p>Next, and slightly more challenging (at least for me)</p>
<ul>
<li>
<p><strong>not so simple:</strong> the configuration of the blog theme, color scheme, the embedded source code boxes and the feature images had evolved over almost half a decade.</p>
<p><em>claude$</em> can you upgrade hugo and theme and fix those backward incompatible changes?</p>
</li>
<li>
<p><strong>unstructured:</strong> it is quite some time ago that I moved from Wordpress to Hugo.</p>
<p><em>claude$</em> maybe there is a better hugo theme or workflow from emacs org-mode to implement a simple blog today in 2025?</p>
</li>
</ul>
<p>Claude code (eg as claude-code-ide in emacs) was quite useful to achive all above goals without forcing me to learn too much about syntactical changes on the hugo side. It also kept me successfully from migrating into yet another technology, but suggested small changes to my org workflow (small mods to the org template for new posts).</p>
]]></content:encoded>
    </item>
    <item>
      <title>Hugo and org: experience some time after migration</title>
      <link>https://analytx.info/2025/09/08/hugo-and-org-experience-some-time-after-migration/</link>
      <pubDate>Mon, 08 Sep 2025 10:00:00 +0200</pubDate>
      <guid>https://analytx.info/2025/09/08/hugo-and-org-experience-some-time-after-migration/</guid>
      <description>Notes on migrating a WordPress/Drupal blog to Hugo with org-mode as the authoring format, using ox-hugo and Doom Emacs.</description>
      <content:encoded><![CDATA[<p>I have now converted all my wordpress and drupal based blog sites to Hugo and Org format as input. Doom emacs has simplified this process as the standart config is already convenient to use and does only require minor tweeks (eg in case you would like to keep a few unpublished blog entries in a common org file).</p>
]]></content:encoded>
    </item>
    <item>
      <title>Monty Hall problem - a small simulation in R</title>
      <link>https://analytx.info/2017/11/12/monty-hall-problem-a-small-simulation-in-r/</link>
      <pubDate>Sun, 12 Nov 2017 12:32:00 +0100</pubDate>
      <guid>https://analytx.info/2017/11/12/monty-hall-problem-a-small-simulation-in-r/</guid>
      <description>A Monte Carlo simulation of the Monty Hall problem in R, showing how a few lines of code settle a famous probability puzzle that trips up intuition.</description>
      <content:encoded><![CDATA[<p>The Monty Hall problem is an interesting example for how much intuition
can mislead us in some statistical contexts. Even more disturbing though is,
for how long we are prepared to debate and defend an expected result before
actually checking our initial guesses using a simple Monte Carlo simulation.</p>
<p>Here is simple simulation implementation the Monty Hall game show
problem:</p>
<p>In the TV show <a href="https://en.wikipedia.org/wiki/Let%27s%5FMake%5Fa%5FDeal">&ldquo;Let&rsquo;s Make a Deal&rdquo;</a> the host Monty Hall would offer to game
participant the choice of three doors. One of them was hiding a valuable price (eg a
car) - behind the other two doors were only two less desirable goats.</p>
<p>After the participant has choose a door, which is still kept closed, Monty would
open one of the other doors showing that it did not hide the price. After
that door was open the participant was asked if he/she wanted to maintain their initial
choice or rather change their mind and now swap
to the other of the two still closed doors.</p>
<p>It turns out, that swapping the initial choice of door is in fact a better
strategy in this game, which is considered by many counter-intuitive. The
public discussion about this problem in a print magazine revealed a lot about
the excitement and sometime rather unscientific culture when debating a
&ldquo;scientific&rdquo; question via public media. And twitter was not even available.</p>
<figure>
    <img loading="lazy" src="goat.jpg"/> 
</figure>

<p>Here a small Monte Carlo toy in R to compare between the
two game strategies. It uses mainly two features:</p>
<ul>
<li>the <code>sample()</code> function to pick randomly &ndash;  here from a vector of doors.</li>
<li>vectors indexed with negative indices: evaluate to all but the &ldquo;subtracted&rdquo;
elements</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span>doors     <span style="color:#f92672">&lt;-</span> <span style="color:#ae81ff">1</span><span style="color:#f92672">:</span><span style="color:#ae81ff">3</span>  <span style="color:#75715e"># 3 doors named 1 to 3</span>
</span></span><span style="display:flex;"><span>N         <span style="color:#f92672">&lt;-</span> <span style="color:#ae81ff">1000</span> <span style="color:#75715e"># we play N times</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>swap_wins <span style="color:#f92672">&lt;-</span> <span style="color:#ae81ff">0</span>    <span style="color:#75715e"># set win counter to 0</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> (i <span style="color:#66d9ef">in</span> <span style="color:#ae81ff">1</span><span style="color:#f92672">:</span>N) {
</span></span><span style="display:flex;"><span>  price  <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">sample</span>(doors,<span style="color:#ae81ff">1</span>)                   <span style="color:#75715e"># randomly pick a door for the price</span>
</span></span><span style="display:flex;"><span>  player <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">sample</span>(doors,<span style="color:#ae81ff">1</span>)                   <span style="color:#75715e"># player randomly picks a door</span>
</span></span><span style="display:flex;"><span>  monty  <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">sample</span>(doors[<span style="color:#f92672">-</span><span style="color:#a6e22e">c</span>(player,price)],<span style="color:#ae81ff">1</span>) <span style="color:#75715e"># monty picks from remaining doors</span>
</span></span><span style="display:flex;"><span>                                              <span style="color:#75715e"># eg price and player doors removed</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  swap_wins <span style="color:#f92672">&lt;-</span> (price <span style="color:#f92672">!=</span> player) <span style="color:#f92672">+</span> swap_wins  <span style="color:#75715e"># did swap strategy win this time?</span>
</span></span><span style="display:flex;"><span>}                                             <span style="color:#75715e"># (price was not behind initial choice)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">message</span>(<span style="color:#e6db74">&#34;fraction of swap wins: &#34;</span>, <span style="color:#a6e22e">round</span>(swap_wins<span style="color:#f92672">/</span>N,<span style="color:#ae81ff">3</span>))
</span></span></code></pre></div><p>The result of around 0.66 confirms that indeed swapping the door after
Monty has opened another one is a better strategy than sticking to the
initial choice, which amounts only to 1/3 probability of success.</p>
<p>Such a simulation is of course neither the only nor the most elegant way
to arrive at this result. But in case of doubt it looks definitely
easier to run a quick test than to embark in a public discussions. At
the time this involved the exchange of 10,000 reader letters in the US and
proportional excitement in a similar thread in Germany. For more details check
out the wikipedia article
<a href="https://en.wikipedia.org/wiki/Monty%5FHall%5Fproblem">here</a> or a small book about this topic (in German) <a href="https://www.rowohlt.de/buch/https://www.rowohlt.de/buch/gero-von-randow-das-ziegenproblem-9783644437111">here</a>.</p>
<p>PS: Did you notice that the simulation program does not actually use the line that determines
Monty&rsquo;s pick? A hint to an analytical solution.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Slides and blog posts with R and emacs org-mode</title>
      <link>https://analytx.info/2017/06/04/slides-and-blog-posts-with-r-and-emacs-org-mode/</link>
      <pubDate>Sun, 04 Jun 2017 12:59:00 +0200</pubDate>
      <guid>https://analytx.info/2017/06/04/slides-and-blog-posts-with-r-and-emacs-org-mode/</guid>
      <description>Authoring R-based presentations and blog posts from a single Emacs org-mode file — keeping code, plots, and prose in sync automatically.</description>
      <content:encoded><![CDATA[<p>Preparing a larger number of slides with <code>R</code> code and plots can be a bit
tedious with standard desktop presentation software like powerpoint or keynote.
The manual effort to change the example code, run the analysis and then cut and
paste updated graphs, tables and code is high. Sooner or later one is bound to
create inconsistencies between code and expected results or even syntax errors</p>
<p>Using <a href="https://www.gnu.org/software/emacs/"><code>emacs</code></a> and its swiss army knife <a href="http://orgmode.org"><code>org-mode</code></a> there is another elegant and
reproducible solution: just export the consistent code and output
from an org-mode code block to generate either slides or blog pages (or both)
from a single source. As a free benefit, this tool chain can also produce
high quality PDF handouts with a managed table of content and an index.</p>
<h2 id="org-mode-source-blocks">Org Mode source blocks</h2>
<p>Here is a standard statistics &ldquo;hello world&rdquo; example using ggplot2:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span>  <span style="color:#a6e22e">library</span>(ggplot2)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  df <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.frame</span>( norm <span style="color:#f92672">=</span> <span style="color:#a6e22e">rnorm</span>(<span style="color:#ae81ff">10000</span>))
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ggplot</span>(df, <span style="color:#a6e22e">aes</span>(x <span style="color:#f92672">=</span> norm)) <span style="color:#f92672">+</span> <span style="color:#a6e22e">geom_histogram</span>()</span></span></code></pre></td></tr></table>
</div>
</div>
<figure>
    <img loading="lazy" src="images/plot.png"/> 
</figure>

<p>The source for the above blog output looks in org mode:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span></span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span></span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;"><code class="language-org" data-lang="org"><span style="display:flex;"><span>  ... some text describing the example...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex; background-color:#3c3d38"><span><span style="color:#75715e">  #+BEGIN_SRC </span><span style="color:#75715e">R</span><span style="color:#75715e"> :exports both :results output graphics :file p.png
</span></span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">library</span>(ggplot2)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  df <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.frame</span>( norm <span style="color:#f92672">=</span> <span style="color:#a6e22e">rnorm</span>(<span style="color:#ae81ff">10000</span>))
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">ggplot</span>(df, <span style="color:#a6e22e">aes</span>(x <span style="color:#f92672">=</span> norm)) <span style="color:#f92672">+</span> <span style="color:#a6e22e">geom_histogram</span>()
</span></span><span style="display:flex; background-color:#3c3d38"><span><span style="color:#75715e">  #+END_SRC</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  ... more standard text...</span></span></code></pre></td></tr></table>
</div>
</div>
<p>Note the options in code block header that define whether source and output (including
graphics) will be visible on the output page or slide. That&rsquo;s all it takes to
produce the post from the code and consistently re-create all output on the
next publish step.</p>
<h2 id="from-blog-entries-to-slides">From blog entries to slides</h2>
<p>To create visually appealing slides from org-mode there are several
alternatives. For my not too complicated workflow I opted for <a href="https://github.com/hakimel/reveal.js"><code>reveal.js</code></a>, the java
script package by Hakim El Hattab. The installation and basic usage of this
package from org-mode is described eg in this very nice <a href="http://cestlaz.github.io/posts/using-emacs-11-reveal/#.WTQcpsZBsUE">tutorial</a> by Mike Zamansky.
With a few steps one can create a self consistent presentation
file, which does not require any major software installations for the
presentation viewer apart from a web browser.</p>
<p>To go from the above org-mode blog source to set of slides is only a small step:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;"><code><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span></span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span></span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span></span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span></span><span style="background-color:#3c3d38"><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span></span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;display:grid;"><code class="language-org" data-lang="org"><span style="display:flex; background-color:#3c3d38"><span>  #+REVEAL_THEME: sky
</span></span><span style="display:flex; background-color:#3c3d38"><span>  #+OPTIONS: toc:nil num:nil reveal_title_slide:nil
</span></span><span style="display:flex; background-color:#3c3d38"><span>  <span style="color:#66d9ef">* </span>The Normal Distribution
</span></span><span style="display:flex;"><span><span style="color:#75715e">  #+BEGIN_SRC </span><span style="color:#75715e">R</span><span style="color:#75715e"> :exports both :results output graphics :file p.png :height 300
</span></span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">library</span>(ggplot2)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    df <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.frame</span>( norm <span style="color:#f92672">=</span> <span style="color:#a6e22e">rnorm</span>(<span style="color:#ae81ff">10000</span>))
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">ggplot</span>(df, <span style="color:#a6e22e">aes</span>(x <span style="color:#f92672">=</span> norm)) <span style="color:#f92672">+</span> <span style="color:#a6e22e">geom_histogram</span>()
</span></span><span style="display:flex;"><span><span style="color:#75715e">  #+END_SRC</span>
</span></span><span style="display:flex; background-color:#3c3d38"><span>  <span style="color:#66d9ef">- </span>symmetric around $x = 0$
</span></span><span style="display:flex; background-color:#3c3d38"><span>  <span style="color:#66d9ef">- </span>$P(x) = \frac{1}{{\sqrt {2\pi } }}e^{ - \frac{{x^2 }}{2}}$</span></span></code></pre></td></tr></table>
</div>
</div>
<p>And here is the result:</p>
<figure>
    <img loading="lazy" src="/images/slide.png"/> 
</figure>

<p>Just a png for now, as the blog publishing step needs a fix to allow accessing <code>reveal.js</code>
from my blog hosting side.</p>]]></content:encoded>
    </item>
    <item>
      <title>Using Data Frames in Feather format (Apache Arrow)</title>
      <link>https://analytx.info/2016/04/07/using-data-frames-in-feather-format-apache-arrow/</link>
      <pubDate>Thu, 07 Apr 2016 18:43:00 +0200</pubDate>
      <guid>https://analytx.info/2016/04/07/using-data-frames-in-feather-format-apache-arrow/</guid>
      <description>Benchmarking the Feather/Apache Arrow format for fast R data frame I/O — comparing read and write performance on a 19 million row dataset.</description>
      <content:encoded><![CDATA[<p>Triggered by the RStudio blog article about <a href="http://blog.rstudio.org/2016/03/29/feather/">feather</a> I did the
one line install and compared the results on a data frame
of 19 million rows. First results look indeed promising:</p>
<div class="highlight"><div style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;">
<table style="border-spacing:0;padding:0;margin:0;border:0;"><tr><td style="vertical-align:top;padding:0;margin:0;border:0;">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32
</span><span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33
</span></code></pre></td>
<td style="vertical-align:top;padding:0;margin:0;border:0;;width:100%">
<pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span><span style="color:#75715e"># build the package</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> devtools<span style="color:#f92672">::</span><span style="color:#a6e22e">install_github</span>(<span style="color:#e6db74">&#34;wesm/feather/R&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># load an existing data frame (19 million rows with batch job execution results)</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">load</span>(<span style="color:#e6db74">&#34;batch-12-2015.rda&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># write it in feather format...</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">write_feather</span>(dt,<span style="color:#e6db74">&#34;batch-12-2015.feather&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># ... which is not compressed, hence larger on disk</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system</span>(<span style="color:#e6db74">&#34;ls -lh batch-12-2015.*&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#f92672">-</span>rw<span style="color:#f92672">-</span>r<span style="color:#f92672">--</span>r<span style="color:#f92672">--</span> <span style="color:#ae81ff">1</span> dirkd staff <span style="color:#ae81ff">813</span>M <span style="color:#ae81ff">7</span> Apr <span style="color:#ae81ff">11</span><span style="color:#f92672">:</span><span style="color:#ae81ff">35</span> batch<span style="color:#ae81ff">-12-2015</span>.feather
</span></span><span style="display:flex;"><span><span style="color:#f92672">-</span>rw<span style="color:#f92672">-</span>r<span style="color:#f92672">--</span>r<span style="color:#f92672">--</span> <span style="color:#ae81ff">1</span> dirkd staff <span style="color:#ae81ff">248</span>M <span style="color:#ae81ff">27</span> Jan <span style="color:#ae81ff">22</span><span style="color:#f92672">:</span><span style="color:#ae81ff">42</span> batch<span style="color:#ae81ff">-12-2015</span>.rda
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># a few repeat reads on an older macbook with sdd</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(<span style="color:#a6e22e">load</span>(<span style="color:#e6db74">&#34;batch-12-2015.rda&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">8.984</span> <span style="color:#ae81ff">0.332</span> <span style="color:#ae81ff">9.331</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(dt1 <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">read_feather</span>(<span style="color:#e6db74">&#34;batch-12-2015.feather&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">1.103</span> <span style="color:#ae81ff">1.094</span> <span style="color:#ae81ff">7.978</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(<span style="color:#a6e22e">load</span>(<span style="color:#e6db74">&#34;batch-12-2015.rda&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">9.045</span> <span style="color:#ae81ff">0.352</span> <span style="color:#ae81ff">9.418</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(dt1 <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">read_feather</span>(<span style="color:#e6db74">&#34;batch-12-2015.feather&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">1.110</span> <span style="color:#ae81ff">0.658</span> <span style="color:#ae81ff">3.997</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(<span style="color:#a6e22e">load</span>(<span style="color:#e6db74">&#34;batch-12-2015.rda&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">9.009</span> <span style="color:#ae81ff">0.356</span> <span style="color:#ae81ff">9.393</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&gt;</span> <span style="color:#a6e22e">system.time</span>(dt1 <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">read_feather</span>(<span style="color:#e6db74">&#34;batch-12-2015.feather&#34;</span>))
</span></span><span style="display:flex;"><span>user system elapsed
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">1.099</span> <span style="color:#ae81ff">0.711</span> <span style="color:#ae81ff">4.548</span></span></span></code></pre></td></tr></table>
</div>
</div>
<p>So, around half the elapsed time and about 1/10th of the user cpu time (uncompressed) !
Of course these measurements are from file system cache rather than the laptop SSD, but
the reduction in wall time is nice for larger volume loads.</p>
<p>More important though is the cross-language support for R, Python,
Scala/Spark and others, which could make feather the obvious exchange
format within a team or between workflow steps with different implementation
language.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Setting up an RStudio server for iPad access</title>
      <link>https://analytx.info/2016/04/05/setting-up-an-rstudio-server-for-ipad-access/</link>
      <pubDate>Tue, 05 Apr 2016 00:00:00 +0200</pubDate>
      <guid>https://analytx.info/2016/04/05/setting-up-an-rstudio-server-for-ipad-access/</guid>
      <description>Running RStudio Server in Docker to access a full R environment from an iPad or any thin client browser, using the rocker/rstudio image.</description>
      <content:encoded><![CDATA[<p>Sometimes it can be convenient to run RStudio remotely from an iPad or another machine with little
RAM or disk space. This can be done quite easily using the free RStudio Server on OSX via docker. To
do this:</p>
<ul>
<li>
<p>Find the rocker/rstudio image on docker hub and follow the setup steps here at <a href="https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image">github</a>.</p>
</li>
<li>
<p>Once the image is running, you should be able to connect with Safari on the host Mac to the login
page eg at</p>
</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>open http://192.168.99.100:8787
</span></span></code></pre></div><p>Now there is is only a small last step needed. You need to expose the server port from the host on
the local network using the OSX firewall. In the somewhat explicit language of the “new” OSX
firewall this can be done using:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>$ echo <span style="color:#e6db74">&#34;rdr pass inet proto tcp from any to any port 8787 -&gt; 127.0.0.1 port 8787&#34;</span> | sudo pfctl -ef -
</span></span></code></pre></div><p>At this point you should be able to connect remotely from your iPad to</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-sh" data-lang="sh"><span style="display:flex;"><span>http://&lt;main-mac-ip-or-name&gt;:8787
</span></span></code></pre></div><p>and continue your R session where you left it before eg on your main machine.</p>
<p>BTW: If your network can not be trusted then you should probably change the default login
credentials as described in the image docs.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Cached, asynchronous IP resolution</title>
      <link>https://analytx.info/2015/01/15/cached-asynchronous-ip-resolution/</link>
      <pubDate>Thu, 15 Jan 2015 19:39:00 +0100</pubDate>
      <guid>https://analytx.info/2015/01/15/cached-asynchronous-ip-resolution/</guid>
      <description>Efficient, non-blocking IP-to-hostname resolution in R using asynchronous DNS lookups and an in-memory cache to avoid repeated DNS queries.</description>
      <content:encoded><![CDATA[<p>Resolving IP addresses to host names is quite helpful for getting a quick overview of
who is connecting from where. This may need some care to not put too much
strain on your DNS server with a large number of repeated lookups. Also you
may not want to wait for timeouts on IPs that do not resolve. R itself is not
supporting this specifically but can easily exploit asynchronous DNS lookup
tools like adns (on OSX from homebrew) and provide a cache to speed things
up. Here is a simple example for a vectorised lookup using a data.table as
persistent cache.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span><span style="color:#a6e22e">library</span>(data.table)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## this basic async lookup is a modified version of an idea described in</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## http://rud.is/b/2013/08/12/reverse-ip-address-lookups-with-r-from-simple-to-bulkasynchronous/</span>
</span></span><span style="display:flex;"><span>ip.to.host <span style="color:#f92672">&lt;-</span> <span style="color:#66d9ef">function</span>(ips) {
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## store ip list in a temp file</span>
</span></span><span style="display:flex;"><span>  tf <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">tempfile</span>()
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">cat</span>(ips, sep<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;\n&#39;</span>, file<span style="color:#f92672">=</span>tf)
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## use the adns filter to resolve them asynchronously (see man page for timeouts and other options)</span>
</span></span><span style="display:flex;"><span>  host.names <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">system</span>(<span style="color:#a6e22e">paste</span>(<span style="color:#e6db74">&#34;adnsresfilter &lt;&#34;</span>, tf) ,intern<span style="color:#f92672">=</span><span style="color:#66d9ef">TRUE</span>, ignore.stderr<span style="color:#f92672">=</span><span style="color:#66d9ef">TRUE</span>)
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## cleanup the temp file</span>
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">file.remove</span>(tf)
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">return</span>(host.names)
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## now extend the above to implement a  ip to name cache</span>
</span></span><span style="display:flex;"><span>ip.cached.lookup <span style="color:#f92672">&lt;-</span> <span style="color:#66d9ef">function</span>(ips, reset.cache<span style="color:#f92672">=</span><span style="color:#66d9ef">FALSE</span>) {
</span></span><span style="display:flex;"><span>  cache.file <span style="color:#f92672">&lt;-</span> <span style="color:#e6db74">&#34;~/.ip.cache.rda&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## if the cache file exists: load it</span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">if</span> (<span style="color:#f92672">!</span>reset.cache <span style="color:#f92672">&amp;</span> <span style="color:#f92672">!</span><span style="color:#a6e22e">file.access</span>(cache.file,<span style="color:#ae81ff">4</span>)){
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">load</span>(cache.file)
</span></span><span style="display:flex;"><span>      <span style="color:#a6e22e">message</span>(<span style="color:#e6db74">&#34;ip cache entries loaded :&#34;</span>, <span style="color:#a6e22e">nrow</span>(host))
</span></span><span style="display:flex;"><span>  } <span style="color:#66d9ef">else</span> {
</span></span><span style="display:flex;"><span>      <span style="color:#75715e">## create an empty table (with just localhost)</span>
</span></span><span style="display:flex;"><span>      host <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.table</span>(hip<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;127.0.0.1&#34;</span>, hname<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;localhost&#34;</span>)
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## prepare a table of query ip and name</span>
</span></span><span style="display:flex;"><span>  qh <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.table</span>(hip<span style="color:#f92672">=</span><span style="color:#a6e22e">as.character</span>(ips),hname<span style="color:#f92672">=</span><span style="color:#66d9ef">NA</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## keep them sorted by ip to speedup data.table lookups</span>
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">setkey</span>(host,hip)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## resolve all known host name from the cache</span>
</span></span><span style="display:flex;"><span>  qh<span style="color:#f92672">$</span>hname <span style="color:#f92672">&lt;-</span> host[qh]<span style="color:#f92672">$</span>hname
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## collect the list of unique ips which did not get resolved yet</span>
</span></span><span style="display:flex;"><span>  new.ips <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">unique</span>(qh<span style="color:#a6e22e">[is.na</span>(qh<span style="color:#f92672">$</span>hname)]<span style="color:#f92672">$</span>hip)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e">## if not empty, resolve the rest</span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">if</span> (<span style="color:#a6e22e">length</span>(new.ips) <span style="color:#f92672">&gt;</span> <span style="color:#ae81ff">0</span>) {
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## add the new ips to the cache table</span>
</span></span><span style="display:flex;"><span>    host <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">rbind</span>(host, <span style="color:#a6e22e">list</span>(hip<span style="color:#f92672">=</span>new.ips,hname<span style="color:#f92672">=</span><span style="color:#66d9ef">NA</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## find locations which need resolution (either new or expired)</span>
</span></span><span style="display:flex;"><span>    need.resolving <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">is.na</span>(host<span style="color:#f92672">$</span>hname)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">message</span>(<span style="color:#e6db74">&#34;new ips to resolve: &#34;</span>, <span style="color:#a6e22e">sum</span>(need.resolving))
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## and resolve them</span>
</span></span><span style="display:flex;"><span>    host<span style="color:#f92672">$</span>hname[need.resolving] <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">ip.to.host</span>(host[need.resolving]<span style="color:#f92672">$</span>hip)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## need to set key again after rbind above..</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">setkey</span>(host,hip)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## .. to do the remaining lookups</span>
</span></span><span style="display:flex;"><span>    qh<span style="color:#f92672">$</span>hname <span style="color:#f92672">&lt;-</span> host[qh]<span style="color:#f92672">$</span>hname
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e">## save the new cache status</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6e22e">save</span>(host, file <span style="color:#f92672">=</span> cache.file)
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#66d9ef">return</span>(qh<span style="color:#f92672">$</span>hname)
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## with this function you can easily add a host.name column to your</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## weblog data.table from the previous posts to get started with</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## the real log analysis</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>w<span style="color:#f92672">$</span>host.name <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">ip.cached.lookup</span>(w<span style="color:#f92672">$</span>host)
</span></span></code></pre></div>]]></content:encoded>
    </item>
    <item>
      <title>Using R for weblog analysis</title>
      <link>https://analytx.info/2015/01/15/using-r-for-weblog-analysis/</link>
      <pubDate>Thu, 15 Jan 2015 19:39:00 +0100</pubDate>
      <guid>https://analytx.info/2015/01/15/using-r-for-weblog-analysis/</guid>
      <description>Parsing and analysing Apache web server logs in R — extracting hits, geolocation, bandwidth usage, and client statistics from raw log files.</description>
      <content:encoded><![CDATA[<h2 id="apache-weblog-analysis">Apache Weblog Analysis</h2>
<p>Whether you run your own blog or web server or use some hosted service – at some point you may be
interested in some information on how well your server or your users are doing. Many infos like hit
frequency, geolocation of users and distribution of spent bandwidth are very useful for this and can
be obtained in different ways:</p>
<ul>
<li>by instrumenting the page running inside the client browser (eg piwik)</li>
<li>by analysis of the web server logs (eg webalizer)</li>
</ul>
<p>For the latter I have been using for several years webalizer, which does nice web based analysis
plots. More recently I moved to a more complicated server environment with several virtual web
services and I found the configuration and data selection options a bit limting. Hence I started as
a toy project to implement the same functionality with a set of simple R scripts, which I will
progressively share here.</p>
<p>As a first step some simple examples for the data import, cleaning and overview plots. We’ll then
add anychronous IP resolution, add and analyse goelocation information and as a last step wrap the
analysis output tables and plots into a web application, which can be consulted from a remote
browser.</p>
<h2 id="data-dot-table-vs-dot-dplyr">data.table vs. dplyr</h2>
<p>One of my favourite R packages for data handling, which I will use also here is the <code>data.table</code>
package. Note: Most of the results can be obtained in a similar way also using the excellent <code>dplyr</code>
package, but in some other (larger volume) studies <code>data.table</code> has some performance and
memory efficiency advantages, so I’ll stick to it also here. If you are using <code>R</code> for data
handling/aggregating and are not familar with either packages – take a look at both and make your
own choice.</p>
<h2 id="importing-the-logs-into-r">Importing the logs into R</h2>
<p>Well, this part is rather simple since apache logs can be read via the standard read.table function:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span><span style="color:#a6e22e">library</span>(data.table)
</span></span><span style="display:flex;"><span><span style="color:#75715e">## read the complete log - your file name is likely different</span>
</span></span><span style="display:flex;"><span>w <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.table</span>(<span style="color:#a6e22e">read.table</span>(<span style="color:#e6db74">&#34;/var/log/apache2/access_log&#34;</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## there are a few different log types which vary in the number and sequence</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## if log items. Have a look at the apache configuration or just the file.</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## In my case I get a so called &#39;combinedvhost&#39; file which lists in the first</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## two columns the website (out of several virtual sites on the some server)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## and as second field the client host which accessed the server.</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## There is a good chance that your server config does omit the first field</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## so you may try to drop the &#39;vhost&#39; string below.</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">setnames</span>(w,<span style="color:#a6e22e">c</span>(<span style="color:#e6db74">&#39;vhost&#39;</span>,<span style="color:#e6db74">&#39;host&#39;</span>,<span style="color:#e6db74">&#39;ident&#39;</span>,<span style="color:#e6db74">&#39;authuser&#39;</span>,<span style="color:#e6db74">&#39;date&#39;</span>,<span style="color:#e6db74">&#39;tz&#39;</span>,<span style="color:#e6db74">&#39;request&#39;</span>,<span style="color:#e6db74">&#39;status&#39;</span>,<span style="color:#e6db74">&#39;bytes&#39;</span>,<span style="color:#e6db74">&#39;refer&#39;</span>,<span style="color:#e6db74">&#39;agent&#39;</span>))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## try the following command to see if data and field names match:</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">summary</span>(w)
</span></span><span style="display:flex;"><span><span style="color:#75715e">## btw: already this summary shows a lot of interesting info</span>
</span></span></code></pre></div>]]></content:encoded>
    </item>
    <item>
      <title>Getting hold of remote weblogs</title>
      <link>https://analytx.info/2015/01/13/getting-hold-of-remote-weblogs/</link>
      <pubDate>Tue, 13 Jan 2015 01:25:00 +0100</pubDate>
      <guid>https://analytx.info/2015/01/13/getting-hold-of-remote-weblogs/</guid>
      <description>Extending R weblog analysis to handle remote log files — using pipes for filtering and SSH-accessible logs from a remote machine.</description>
      <content:encoded><![CDATA[<p>The last post was assuming that the weblogs to analyse are directly accessible
by the R session which may not be the case if your analysis is running on a
remote machine. Also in some cases you may want to filter out some
uninteresting log records (eg local clients on the web server or local area
accesses from known clients). The next examples show how to modify the
previous R script using the R pipe function to take this into account:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span><span style="color:#75715e">## read the last 100K log entries from svr via a ssh connection</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## (this assumes you have setup the ssh keys correctly beforehand)</span>
</span></span><span style="display:flex;"><span>w <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.table</span>(<span style="color:#a6e22e">read.table</span>(<span style="color:#a6e22e">pipe</span>(<span style="color:#e6db74">&#34;ssh svr &#39;tail -n 100000 /var/log/apache2/access_log&#39;&#34;</span>)))
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## in addition filter out all accesses from local clients on the web</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## server or the local subnet (in this case 192.168.10.xxx)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>w <span style="color:#f92672">&lt;-</span> <span style="color:#a6e22e">data.table</span>(<span style="color:#a6e22e">read.table</span>(<span style="color:#a6e22e">pipe</span>(<span style="color:#e6db74">&#34;ssh svr &#39;tail -n 100000 /var/log/apache2/access_log | awk \&#34;\\$2 !~ /127\\.0\\.0\\.1|192\\.168\\.10\\./\&#34;&#39;&#34;</span>)))
</span></span><span style="display:flex;"><span><span style="color:#75715e">## note: the proper quoting/escaping of R and shell strings on this one takes</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e">## more effort than the processing. There must be an R function which does this...</span>
</span></span></code></pre></div><p>In a similar way you could concatenate multiple (eg already logrotated) logs
and/or unzip logfiles. As this pre-filtering takes place locally on the server machine
holding the log files this helps to bring down the data amount to be transfered
and analysed: always a good start to avoid the popular &lsquo;unecessarily big data&rsquo;
syndrome&hellip;</p>
]]></content:encoded>
    </item>
    <item>
      <title>Hadoop &#43; kerberos with RStudio</title>
      <link>https://analytx.info/2014/12/02/hadoop--kerberos-with-rstudio/</link>
      <pubDate>Tue, 02 Dec 2014 18:23:00 +0100</pubDate>
      <guid>https://analytx.info/2014/12/02/hadoop--kerberos-with-rstudio/</guid>
      <description>How to configure RStudio and RHadoop to access a Kerberos-secured Hadoop cluster from macOS, including the GUI environment variable fix.</description>
      <content:encoded><![CDATA[<p>While the setup from the previous posts works for the hadoop shell commands, you will
still fail to access the remote cluster from GUI programs (eg <a href="http://www.rstudio.com/products/rstudio/">RStudio</a>) and/or
with hadoop plugins like <a href="https://github.com/RevolutionAnalytics/RHadoop/wiki">RHadoop</a>.</p>
<p>There are two reasons for that:</p>
<ul>
<li>GUI programs do not inherit your terminal/shell enviroment variables -
unless you start them from a terminal session with</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span> $ open /Applications/RStudio.app
</span></span></code></pre></div><ul>
<li>$HADOOP_OPTS / $YARN_OPTS are not evaluated by other programs even
if the variables are present in their execution environment.</li>
</ul>
<p>The first problem is well covered by various blog posts. The main difficulty
is only to find the correct procedure for your OSX version,since Apple has changed
several times over the years:</p>
<ul>
<li>using a .plist file in ~ <code>/.MacOS</code> (before Maverics)</li>
<li>using a setenv statement line <code>/etc/launchd.conf</code> (Mavericks)</li>
<li>using the <code>launchctl setenv</code> command (from Yosemite)</li>
</ul>
<p>To find out which variable is used inside your GUI program or plugin may need
some experimentation or look at the source. For java based plugins the variable
<code>_JAVA_OPTIONS</code> which is always evaluated may be a starting point. For RHadoop
package the more specific HADOOP_OPTS is already sufficient, so on yosemite:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ launchctl setenv HADOOP_OPTS <span style="color:#e6db74">&#34;-Djava.security.krb5.conf=/etc/krb5.conf&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># prefix command with sudo in case you want the setting for all users</span>
</span></span></code></pre></div><p>If you need the setting only inside R/RStudio you could simply add the
enviroment setting in your R scripts before initialising the RHadoop packages.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-R" data-lang="R"><span style="display:flex;"><span><span style="color:#75715e"># wrapper script:  hadoop --config ~/remote-hadoop-conf</span>
</span></span><span style="display:flex;"><span>hadoop.command <span style="color:#f92672">&lt;-</span> <span style="color:#e6db74">&#34;~/scripts/remote-hadoop&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">Sys.setenv</span>(HADOOP_OPTS <span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Djava.security.krb5.conf=/etc/krb5.conf&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">Sys.setenv</span>(HADOOP_CMD<span style="color:#f92672">=</span>hadoop.command)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># load hdfs plugin for R</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">library</span>(rhdfs)
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">hdfs.init</span>()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># print remote hdfs root directory</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">print</span>(<span style="color:#a6e22e">hdfs.ls</span>(<span style="color:#e6db74">&#34;/&#34;</span>))
</span></span></code></pre></div>]]></content:encoded>
    </item>
    <item>
      <title>Connect to a remote, kerberized hadoop cluster</title>
      <link>https://analytx.info/2014/12/02/connect-to-a-remote-kerberized-hadoop-cluster/</link>
      <pubDate>Tue, 02 Dec 2014 10:28:00 +0100</pubDate>
      <guid>https://analytx.info/2014/12/02/connect-to-a-remote-kerberized-hadoop-cluster/</guid>
      <description>Step-by-step guide to connecting a macOS Hadoop client to a remote Kerberos-authenticated cluster, including krb5.conf setup and HDFS access.</description>
      <content:encoded><![CDATA[<p>To use a remote hadoop cluster with kerberos authentication you will need to
get a proper <code>krb5.conf</code> file (eg from your remote cluster <code>/etc/kerb5.conf</code>)
and place the file <code>/etc/krb5.conf</code> on your client OSX machine. To use this
configurations from your osx hadoop client change your <code>.[z]profile</code> to:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>export HADOOP_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Djava.security.krb5.conf=/etc/krb5.conf&#34;</span>
</span></span><span style="display:flex;"><span>export YARN_OPTS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-Djava.security.krb5.conf=/etc/krb5.conf&#34;</span>
</span></span></code></pre></div><p>With java 1.7 this should be sufficient to detect the default realm, the kdc
and also any specific authentication options used by your site. Please make sure
the kerberos configuration is already in place when you obtain your ticket with</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ kinit
</span></span></code></pre></div><p>In case you got a ticket beforehand you may have to execute kinit again
or login to local account again.</p>
<p>For the next step you will need to obtain the remote cluster configuration files
(eg scp the config files from the remote cluster to a local directory, eg
to ~/remote-hadoop-conf). The result should be a local copy similar to this:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>$ ls -l  ~/remote-hadoop-conf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>total 184
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  4146 Jun 25  2013 capacity-scheduler.xml
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  4381 Oct 21 11:44 core-site.xml
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 dfs.includes
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff     0 Jun 25  2013 excludes
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff   896 Dec  1 11:44 hadoop-env.sh
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  3251 Aug  5 09:50 hadoop-metrics.properties
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  4214 Oct  7  2013 hadoop-policy.xml
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  7283 Nov  3 16:44 hdfs-site.xml
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  8713 Nov 18 16:26 log4j.properties
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  6112 Nov  5 16:52 mapred-site.xml
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff   253 Aug 21 11:46 mapred.includes
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff   127 Apr  4  2014 taskcontroller.cfg
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff   931 Oct 20 09:44 topology.table.file
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff    70 Jul  2 11:52 yarn-env.sh
</span></span><span style="display:flex;"><span>-rw-r--r--  1 dirkd  staff  5559 Nov  5 16:52 yarn-site.xml
</span></span></code></pre></div><p>Then point your hadoop and hdfs command to this
configuration:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ hdfs --config ~/remote-hadoop-conf dfs -ls /
</span></span></code></pre></div><p>If all worked well, then you should see at this point the content of the remote
hdfs directory and you will be ready to use the standard hdfs or hadoop
commands remotely.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Basic set-up of hadoop on OSX yosemite</title>
      <link>https://analytx.info/2014/11/18/basic-set-up-of-hadoop-on-osx-yosemite/</link>
      <pubDate>Tue, 18 Nov 2014 21:49:00 +0100</pubDate>
      <guid>https://analytx.info/2014/11/18/basic-set-up-of-hadoop-on-osx-yosemite/</guid>
      <description>Complete guide to installing and configuring Apache Hadoop on macOS Yosemite using Homebrew. Includes HDFS setup, troubleshooting Kerberos issues, and preparing for remote cluster access.</description>
      <content:encoded><![CDATA[<h2 id="why-would-i-do-this">Why would I do this?</h2>
<p>An OSX laptop will not allow to do any larger scale data processing, but it may
be convenient place to develop/debug  hadoop scripts before running on a real
cluster. For this you likely want to have a local hadoop “cluster” to play
with, and use the local commands as client for an larger remote hadoop
cluster. This post covers the local install and basic testing. A second post
shows how to extend the setup for accessing /processing against a remote
kerberized cluster.</p>
<h2 id="getting-prepared">Getting  prepared</h2>
<p>If you don’t yet have java (yosemite does not actually come with it) then the
first step is to download the installer from the Oracle <a href="https://www.java.com/en/download/index.jsp">download site</a>.  Once
installed you should get in a terminal shell something like:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ java -version
</span></span><span style="display:flex;"><span>java version <span style="color:#e6db74">&#34;1.7.0_45&#34;</span>
</span></span><span style="display:flex;"><span>Java<span style="color:#f92672">(</span>TM<span style="color:#f92672">)</span> SE Runtime Environment <span style="color:#f92672">(</span>build 1.7.0_45-b18<span style="color:#f92672">)</span>
</span></span><span style="display:flex;"><span>Java HotSpot<span style="color:#f92672">(</span>TM<span style="color:#f92672">)</span> 64-Bit Server VM <span style="color:#f92672">(</span>build 24.45-b08, mixed mode<span style="color:#f92672">)</span>
</span></span></code></pre></div><p>If you need to have several java versions installed and want to be able to
switch between them: take have a look at the nice description <a href="http://java.dzone.com/articles/multiple-versions-java-os-x">here</a>.</p>
<p>If you don’t yet have the homebrew package manager installed then get it now
by following the (one line) installation on <a href="http://brew.sh">http://brew.sh</a>. Homebrew packages
live in <code>/usr/local</code> and rarely interfere with other stuff on your machine
(unless you ask them to). Install the hadoop package as a normal user using:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ brew install hadoop
</span></span></code></pre></div><p>(At the time of writing I got hadoop 2.5.1)</p>
<p>BTW: Once you start using brew also for other packages, be careful when
using <code>brew upgrade</code>. Eg you may want to use <code>brew pin</code> to avoid getting eg a new
hadoop versions installed, while doing other package upgrades.</p>
<h2 id="configure">Configure</h2>
<p>Next stop: edit a few config files: In <code>.[z]profile</code> you may want to add a few
shortcuts to quickly jump to the relevant places or to be able to switch
between hadoop and java versions, but this is not strictly required to run hadoop.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>export JAVA_HOME<span style="color:#f92672">=</span><span style="color:#66d9ef">$(</span>/usr/libexec/java_home<span style="color:#66d9ef">)</span>
</span></span><span style="display:flex;"><span>export HADOOP_VERSION<span style="color:#f92672">=</span>2.5.1
</span></span><span style="display:flex;"><span>export HADOOP_BASE<span style="color:#f92672">=</span>/usr/local/Cellar/hadoop/<span style="color:#e6db74">${</span>HADOOP_VERSION<span style="color:#e6db74">}</span>
</span></span><span style="display:flex;"><span>export HADOOP_HOME<span style="color:#f92672">=</span>$HADOOP_BASE/libexec
</span></span></code></pre></div><p>Now you should edit a few hadoop files in your hadoop configuration directory:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>cd $HADOOP_HOME/etc/hadoop
</span></span></code></pre></div><p>in <code>core-site.xml</code> expand the configuration to:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>hadoop.tmp.dir<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>/usr/local/Cellar/hadoop/hdfs/tmp<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;description&gt;</span>A base for other temporary directories.<span style="color:#f92672">&lt;/description&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>fs.default.name<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>hdfs://localhost:9000<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>In <code>hdfs-site.xml</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>dfs.replication<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>1<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>and finally in <code>mapred-site.xml</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#f92672">&lt;configuration&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;property&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;name&gt;</span>mapred.job.tracker<span style="color:#f92672">&lt;/name&gt;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&lt;value&gt;</span>localhost:9010<span style="color:#f92672">&lt;/value&gt;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&lt;/property&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">&lt;/configuration&gt;</span>
</span></span></code></pre></div><p>Now its time to:</p>
<ul>
<li>Initialise hdfs</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ hadoop namenode -format
</span></span></code></pre></div><ul>
<li>Start hdfs and yarn</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e">## start the hadoop daemons (move to launchd plist to do this automatically)</span>
</span></span><span style="display:flex;"><span>$ $HADOOP_BASE/sbin/start-dfs.sh
</span></span><span style="display:flex;"><span>$ $HADOOP_BASE/sbin/start-yarn.sh
</span></span></code></pre></div><ul>
<li>Test your hdfs setup</li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e">## show (still empty) homedir in hdfs</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -ls
</span></span><span style="display:flex;"><span><span style="color:#75715e">## put some local file</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -put myfile.txt
</span></span><span style="display:flex;"><span><span style="color:#75715e">## now we should see the new file</span>
</span></span><span style="display:flex;"><span>$ hdfs dfs -ls
</span></span></code></pre></div><h2 id="work-around-an-annoying-kerberos-realm-problem-on-osx">Work around an annoying Kerberos realm problem on OSX</h2>
<p>The hadoop setup will at this point likely still complain with a message
like <code>Unable to load realm info from SCDynamicStore</code>, which is caused by a java
bug on OSX (more <a href="http://mail.openjdk.java.net/pipermail/macosx-port-dev/2013-March/005443.html">details here</a>).</p>
<p>There are different ways to work around this, depending on whether you just
want to get a local hadoop installation going or need your hadoop client to
(also) access a remote kerberized hadoop cluster.</p>
<p>To get java running on the local (non-kerberized) setup, it is
sufficient to just add some definitions to <code>$HADOOP_OPTS</code> (and <code>$YARN_OPTS</code> for
yarn) in <code>.[z]profile</code> as described in <a href="http://stackoverflow.com/questions/7134723/hadoop-on-osx-unable-to-load-realm-info-from-scdynamicstore">this post</a>.</p>
<p>The actual hostname probably does not matter too much, as you won’t do an
actual kerberos exchange locally, but just get past the flawed
“do we know a default realm” check in java.</p>
<p>In case you are planning to access a kerberized hadoop cluster
please continue reading the next post.</p>
<h2 id="cleaning-up">Cleaning up</h2>
<p>Some of the default logging settings make hadoop rather chatty on the console
about deprecated configuration keys and other things. On OSX there are a few
items that get nagging after a while as they make it harder to spot real
problems.  You may want to adjust the <code>log4j</code> settings to mute warnings that you
don’t want to see every single time you enter a hadoop command. In
<code>$HADOOP_HOME/etc/hadoop/log4j.properties</code> you could add:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-properties" data-lang="properties"><span style="display:flex;"><span><span style="color:#75715e"># Logging Threshold</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.threshold</span><span style="color:#f92672">=</span><span style="color:#e6db74">ALL</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># the native libs don&#39;t exist for OSX</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.logger.org.apache.hadoop.util.NativeCodeLoader</span><span style="color:#f92672">=</span><span style="color:#e6db74">ERROR</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># yes, we&#39;ll keep in mind that some things are deprecated</span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">log4j.logger.org.apache.hadoop.conf.Configuration.deprecation</span><span style="color:#f92672">=</span><span style="color:#e6db74">ERROR</span>
</span></span></code></pre></div>]]></content:encoded>
    </item>
  </channel>
</rss>
