Install Hadoop & HDFS on OS X 10.11

Below is a simple guide on how setting Hadoop and HDFS on OS X El Capitan. I’ve taken bits and pieces of guides (cited) in an attempt to demonstrate that this setup works with El Capitan. I haven’t yet upgraded to Sierra to check whether the prescribed methods work, so proceed at your own risk. Make sure to comment and message me if I have left any details out!

Install Brew and Hadoop

Homebrew makes installing and managing Hadoop files simple for OS X. If you haven’t installed or used it before, I highly recommend it. I use brew for installing a variety of other tools regularly! In order to proceed to the next step, the following tasks must be completed and are outlined in the gist below.

  • install brew
  • install hadoop with brew
  • note hadoop version number

Modify Hadoop Files

Brew installs these libraries in the /usr/local/Cellar directory of your file system. The files that we need to modify to enable hadoop to run in a pseudo-distributed mode are found in /usr/local/Cellar/hadoop/<version>/libexec/etc/hadoop/. The <version> string should be replaced with the version of hadoop you installed from brew. I installed hadoop 2.7.3, so my path to the files is /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/. In this directory, core-site.xml and hdfs-site.xml need to be edited while mapred-site.xml needs to be created. The modified files and added files are described in the gists listed below.

See Hadoop: Setting up a Single Node Cluster for more information on the setup of these files.

Starting Hadoop And HDFS

Lastly, the following information needs to be appended to the environment file of your respective shell. I’m using zsh, so I append this code to my .zshrc file.

After this code has been appended to the environment file, we simply need to source the file to load the aliases in our running shell.

# ex: $ source ./zshrc
$ source ./<file>

Continuing w/ Guide

The Single Cluster Guide can be followed for further instructions that are not OS X or Homebrew installation specific. Note after installing hadoop via brew, you will be able to access the hadoop and hdfs command from your command line shell w/o using bin/hdfs.

Testing With Spark

In the near future I’m going to write a post about setting up Jupyter Notebook and Spark locally. This section is entirely optional as I use reading from HDFS as a simple test for my installation. For simplicitly, I downloaded the linkage dataset from UCI (available here).

After continuing the setup of HDFS and Hadoop and unziping each of the blocks, we can push csv files locally to HDFS using the following commands.

I lastly run the pyspark code to read the data from HDFS and print out 10 results.

You're awesome for taking time out of your day to read this! Please consider sharing below!