Google Compute Engine Hadoop clusters with zdutil

Here at zulily, we use Google Compute Engine (GCE) for running our Hadoop clusters. Google has a utility called bdutil for setting up and tearing down Hadoop clusters on GCE. We ran into a number of issues when using the utility and were using an internally patched version of it to create our Hadoop clusters. If you look at the source, bdutil is essentially a collection of bash scripts that automate the various steps of creating a GCE instance and provisioning it with all the necessary software needed to run Hadoop. One major issue we found with bdutil was that there is no way to provision a Hadoop cluster where the datanodes do not have external IP addresses. For clusters with many datanodes — the kind we typically run — this means we end up running against our quota of external IP addresses. Additionally, there is no reason for the datanodes to have external IP addresses as they should not be accessible to the public.

We decided to stop patching bdutil and write our own utility to provision a Hadoop cluster. The utility is called zdutil and you can find it on our GitHub page. Here’s how it works:

  • First, GCE instances are created for the namenode and all datanodes in your Hadoop cluster.
  • Then, any persistent disks that you requested are created and attached to the instances.
  • If you have have any tags that you would like to be applied to the namenode or datanodes, the tags are added to the instances. This saves you from having to manually tag every single instance in your cluster or write your own script to do so.
  • Next, all of the required setup scripts to provision the namenode and datanodes are copied to a GCS bucket of your choosing. The namenode then provisions itself.
  • Once it completes, it copies (via scp) all scripts needed for datanode provisioning to each datanode and then each datanode will provision itself.
  • Once all datanodes have been provisioned, the namenode will start the Hadoop cluster.

If you deploy the datanodes with either external or ephemeral IP addresses, they will have internet access as determined by the rules of your GCE network. If you deploy the datanodes with “none” for the IP address, they will proxy through the namenode using Squid. You don’t have to configure any of this yourself; zdutil will take care of the details for you, including installing and provisioning Squid on your namenode. It is also important to be aware that Google’s version of the Google Cloud Storage Connector currently does not support proxying. If you use zdutil, it will install our fork of the GCS Connector which does support proxying by adding the following properties to your Hadoop core-site.xml configuration file: fs.gs.proxy.host and fs.gs.proxy.port.

If you have any need for zdutil, please use it and give us your feedback. At the moment we only support Debian-based images and we only support Hadoop version 1. If you would like to see another OS supported or Yarn support, please add an issue to the GitHub page.