Preparing to install Cloudera CDH

by Alex McLintock and Alan Duval of Alephant.co.uk, Sept 2017

 

This page describes the steps you need to go through to prepare some machines for installing Cloudera Manager and CDH. This has been written and tested assuming you are installing Cloudera Manager/CDH 5.12 on a cluster of machines with clean installs of CentOS 7.3.

If you haven't already, you may want to read the introduction to this series: Installing Cloudera Hadoop on CentOS/RedHat

If you have already undertaken the necessary preparation, and are just looking for additional information on the installation of Cloudera Manager/CDH, you can jump to: Doing the Install: Cloudera CDH5.12 on CentOS7.3

 

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-1-Intro

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-3-Install

 

Hardware

In this article we are assuming that we are installing on five machines, which we are calling fleet01 to fleet05. The first two machines are considered controllers or masters (and as such may have RAID or SSD). They will have the most memory, and less disk space, as compared to the other machines (assuming there is any difference at all). The other machines are considered workers and they do the bulk of the work. When we want to add capacity later on we are typically adding new worker nodes.

Note - There is little point in installing Hadoop on fewer than five machines.

 

Nodes

On the worker nodes we want individual disks to be mounted separately (e.g. in /mnt/disk01/mnt/disk02/mnt/disk03) and NOT all merged together with LVM or some other tool. 

On the master/controller nodes we can utilise whatever RAID is available. 

 

? - I am not sure what file systems are preferable. I suggest ext3 for worker nodes - though you may wish to do some research. Something like ZFS may be suitable for the controller nodes - but I am not aware of anyone who uses it that way.

 

Quick Start VMs should not be used

The Quickstart system is a "fake" single machine cluster built into a virtual machine image. (Hortonworks calls this a sandbox, Cloudera Quickstart.) It is not really helpful in creating a new Hadoop cluster. Don't use the Quickstart VMs for anything other than a learning or demonstration system on one machine.

 

Software Sources

There are a number of different ways to get the relevant software onto your machines. If you have previously done a Cloudera install using tarballs, note that Cloudera have deprecated that method, as of Cloudera Manager 5.9.0, and it will be unavailable from version 6.0 (detailed in Installation Path C - Manual Installation Using Cloudera Manager Tarballs).

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_path_c.html

 

Do we use Cloudera repositories or a Local Yum Software repository?

On CentOS/RedHat systems software is normally installed from repositories. Cloudera run at least two repositories for their software - one for CDH and one for Cloudera Manager. We are going to install Cloudera Manager on one machine and get it to install CDH on all of the machines.

! - In order to install Cloudera software we have three options:

  1. have all the machines fetch directly from Cloudera repositories (which can be heavy on your internet connection).
  2. have the machines fetch from a local repository which we pre-populate with files in one of two ways:
    1. Use packages - and create your own local yum repository to reduce internet traffic, or
    2. Use parcels - and let Cloudera Manager act as a repository.
  3. Use a caching web proxy such as Squid

URL: http://www.squid-cache.org/

 

! - Using parcels is the option we have chosen. It is simplest and requires less network traffic. If, on the other hand, you want to stick with packages then you could create your own yum repository, but I think you should only do that if you have some overriding reason - such as it being company best-practice.

?- I you want us to write an article on this approach, let us know in the comments.

There are a few other options, e.g. Option 3, which is to use a caching web proxy instead of a local repository. We won't describe how to set one up here, but you can google. Squid seems to be a popular choice for that. 

 

Prior to Installation

Prior to installation it needs to be proven that the machines are setup, and can talk to each other.

The following is roughly going over the points raised in Configuration Requirements for Cloudera Manager, Cloudera Navigator, and CDH.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/installation_reqts.html

 

OS

We are assuming that you have already installed a recent CentOS onto all of the machines in the cluster. If not, then go and read one of many articles on how to do that. We use the minimal install of Centos 7.3 ourselves, and subsequently install any other tools we need on top of that, but you have free choice in this.

Note - We believe that almost all of our instructions should work the same on a Redhat Enterprise Linux System (RHEL).

 

FQDN

Each machine should have a Fully Qualified Domain Name and be found in DNS. You can use just IP addresses, but this is not recommended and may fail in unpredictable ways. I understand that reverse DNS needs to be set up too - so that Hadoop can find out the hostname from the IP address, not just the other way around.

 

Passwordless Root login

Cloudera Manager needs to run stuff on all of the machines in the cluster. My preferred way of doing this is by setting up passwordless root access so that root can run commands on any box in the cluster. 

There are many web pages talking about how you create /root/.ssh/id_rsa and /root/.ssh/id_rsa.pub and make sure the latter id_rsa.pub is added to file "authorized_hosts" on every machine in the cluster. There are other options but this is the one which works most reliably for me.

 

SELinux - Security Enhanced Linux

You should be Disabling SELinux, or at least making it permissive during installation. If you want to, you can reconfigure SELinux afterwards. Following is how you might do that on a CentOS box.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/install_cdh_disable_selinux.html

 

Note - In Cloudera's documentation the following is noted ONLY under 'Installing and Deploying CDH Using the Command Line' (Install Path A), as far as we can see, but is in fact necessary regardless of your installation path (it can be tricky to diagnose errors that arise otherwise).

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_command_line.html

 

Check the SELinux state:

getenforce

 

The result you want is permissive or disabled.

 

If you get the result enforcing, open /etc/selinux/config from the command line, as follows:

vi config

 

Note - If you can't find /etc/selinux/config the file may be under /etc/sysconfig/selinux. If so, rather than vi config, you would type vi selinux

 

Change the line SELINUX=enforcing to SELINUX=permissive.

 

 

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=permissive
# SELINUXTYPE= can take one of three two values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted

 

Save and close the file.

 

Restart your system or run setenforce 0 (that's a zero) from the command line, to disable it immediately, and continue with pre-install configuration.

 

Disabling the Firewall

You may find that there are other firewall programs on your particular Linux. They need to be switched off or heavily reconfigured. Here is how we might switch off iptables on CentOS.

Save the existing iptables rule set:

iptables-save > /root/firewall.rules

 

Then disable them:

chkconfig iptables off

 

ntp - Network Time Protocol

ntp is necessary for keeping the machine clocks in your cluster in sync. We followed RHEL7: How to set up the NTP service to do this - but you can do it however you like...

...including Cloudera's way: Enabling NTP.

 

URL: https://www.certdepot.net/rhel7-set-ntp-service/

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/install_cdh_enable_ntp.html

 

timedatectl
# Confirm that the same timezone is set for each box.
# If it isn't then set it using timedatectl eg timedatectl set-timezone America/Los_Angeles

# I am using ntp but there are other options
yum install -y ntp
systemctl enable ntpd
systemctl start ntpd

 

Optimizing Performance

There are a number of settings that can be tweaked in CentOS to improve performance, detailed in Cloudera's Optmizing Performance in CDH. These are presented below, with the separate URLs for each section.

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html 

 

Switch off Swap - sort of

With many modern systems swap can interfere with efficient memory use. it is common to switch it off in servers and especially in Hadoop servers. However that may not be exactly what you want - e.g. with MySQL. Cloudera recommends Setting the vm.swappiness Linux Kernel Parameter to a very low setting, ideally between 1 and 10 (out of 100).

!- I suggest that you read up on the subject and decide on what your swap strategy should be.

Here's some additional detail on possible errors for some CentOS installs: OOM (Out of Memory) relation to vm.swappiness=0 in new kernel 

 

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__section_xpq_sdf_jq

URL: https://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/

 

To view your current setting for vm.swappiness run:

cat /proc/sys/vm/swappiness

To set vm.swappiness to 1, run:

sudo sysctl -w vm.swappiness=1

 

Transparent Huge Pages

Switching off Transparent Huge Pages (THP) is a performance optimisation. THP is great for many uses - but not Hadoop. As such, Cloudera recommend Disabling Transparent Huge Pages (THP).

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__section_hw3_sdf_jq

For us that was 

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

 

Disable Tuned Service

Cloudera also recommend that you Disable the Tuned Service, but without any details as to why, or what options there are, which seems to imply it's a necessity.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__disable-tuned

 

For us the process was:

  1. Ensure that the tuned service is started:
    systemctl start tuned
  2. Turn the tuned service off:
    tuned-adm off
  3. Ensure that there are no active profiles:
    tuned-adm list
    The output should contain the following line:
    No current active profile
  4. Shutdown and disable the tuned service:
    systemctl stop tuned
    systemctl disable tuned

     

Note - The Cloudera documentation, in point one, says systemctl start - it should say systemctl start tuned, as above.

 

Improving Performance and Best Practice

The Cloudera documentation goes into detail about further optimizations such as Improving Performance in Shuffle Handler and IFile ReaderBest Practices for MapReduce Configuration, and Tips and Best Practices for Jobs. We have not gone into these here, as these are system tweaks that don't ultimately affect the install, and that can be made post-install, however, you may wish to read through them now, or make a note for later.

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__section_nt5_sdf_jq

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__best-mapreduce

URL: https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_admin_performance.html#cdh_performance__section_m4h_tdf_jq

 

Installation

Next up is actually Doing the Install: Cloudera CDH5.12 on CentOS7.3.

 

URL: http://www.alephant.co.uk/Installing_CDH_on_Hadoop-3-Install

Tags