DevOps for Hadoop
By Bob Aiello
Apache Hadoop is a framework that enables the analysis of large datasets (or what some folks are calling “Big Data”), using clusters of computers or cloud-based “elastic” resources offered by most cloud-based providers. Hadoop is designed to scale up from a single server to thousands of machines, allowing you to start with a simple installation and scale to a huge implementation capable of processing petabytes of structured (or even unstructured) complex data. Hadoop has many moving parts and is both elegant in its simplicity and challenging in its potential complexity. This article explores some of the essential concerns that the DevOps practitioner needs to consider in automating the systems and applications build, package and deployment of components in the Hadoop framework.
For this article, I read several books  and implemented a Hadoop sandbox using both Hortonworks as well as Cloudera. In some ways, I found the technology to be very familiar as I have supported very large Java-based trading systems, which were deployed in web containers from Tomcat to Websphere. At the same time, I was humbled by the sheer number of components and dependencies that must be understood by the Hadoop administrator as well as the DevOps practitioner. In short, I am kid in a candy store with Hadoop, come join me as we sample everything from the lollipops to the chocolates 🙂
To begin learning Hadoop you can set up a single node cluster. This configuration is adequate for simple queries and more than functional for learning about Hadoop and then scaling to a cluster setup when you are ready to implement your production environment. I found the Hortonworks sandbox to be particularly easy to implement using virtual box (although almost everything that I do these days is on docker containers). Both Cloudera Manager and Apache Ambari are administration consoles that help to deploy and manage the hadoop framework.
I used Ambari which has features that help provision, manage, and monitor Hadoop clusters, including supporting the Hadoop Distributed File System (HDFS), Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. I used the Ambari dashboard which helps to view cluster health (including heatmaps) and also provides the ability to view MapReduce, Pig and Hive applications including performance and other resources. HDFS can also be supported on Amazon Simple Storage Service (S3) buckets, Azure blobs and OpenStack Object Storage (Swift) .
One of the most interesting things about Hadoop is that you can use “commodity hardware” which does not necessarily mean cheap, but does mean that you use the servers that you are able to obtain and support and they do not all have to be the exact same type of machine. Obviously, the DevOps practitioner will need to pay attention to the requirements for provisioning, and especially supporting, the many required configuration files. This is where configuration tools including Chef, Puppet, CFEngine, Bcfg2 and SaltStack can be especially helpful, although I am finding myself migrating towards using Ansible for configuration management as well as both infrastructure and application deployments.
Logfiles, as well as the many environment settings, need to be monitored. The administration console provides a wide array of alerts which can identify potential resource and operational issues before there is end-user impact. The Hadoop framework has daemon processes running, each of which require one or more specific open ports for communication.
Anyone who reads my books and articles knows that I emphasize processes and procedures which ensure that you can prove that you have the right code in production, detect unauthorized changes and have the code “self-heal”, through returning to a known baseline (obviously while adhering to change control and other regulatory requirements). Monitoring baselines in a complex java environment utilizing web containers including Tomcat, jBoss and WebSphere can be very difficult especially because there are so many files which are dynamically changing and therefore should not be monitored. Identifying the files which should be monitored (using cryptographic hashes including MAC SHA1 and MD5) can take some work and should be put in place from the very beginning of the development lifecycle in all environments from development test to production.
In fact getting Hadoop to work in development and QA testing environments does take some effort, giving you an opportunity to start working on your production deployment and monitoring procedures. The lessons learned while setting up the lower environments (e.g. development test) can help you begin building the automation to support your production environments. I had a little trouble getting the Cloudera framework to run successfully using docker containers, but I am confident that I will get this to work in the coming days. Ultimately, I would like to use docker containers to support Hadoop – mostly likely with Kubernetes for orchestration. You can also run Hadoop on hosted services such as AWS EMR or use AWS EC2 instances (which could get pretty expensive). Ultimately, you want to run a lean environment that has the capability of scaling to meet peek usage needs, but can also scale down when all of those expensive resources are not needed.
Hadoop is a pretty complex system with many components and as complicated as the DevOps requirements are, I cannot imagine how impossible it would be to manage a production Hadoop environment without DevOps principles and practices. I am interested in hearing your views on best practices around supporting Hadoop in production as well as other complex systems. Drop me a line to share your best practices!
 Hadoop the Definitive Guide, by Tom White, O’Reilly Media; 4 edition, 2015
 Pro Apache Hadoop, by Jason Venner et al, Apress; 2nd ed. edition, 2014
 Aiello, Robert and Leslie Sachs. Configuration Management Best Practices: Practical Methods that Work in the Real World. Addison-Wesley, 2010
 Aiello, Robert and Leslie Sachs. Agile Application Lifecycle Management: Using DevOps to Drive Process Improvement, Addison-Wesley, 2016