Table of contents

Set up a remote Hadoop cluster to work with Watson Studio Local

You can set up a remote Hadoop cluster to allow Watson Studio Local users to securely access the data that resides on a Hadoop cluster and submit jobs to use Spark and compute resources on the Hadoop cluster.

Watson Studio Local interacts with a Hadoop cluster through the following four services:
Service Purpose
WebHDFS Browse and preview HDFS data
WebHCAT Browse and preview Hive tables
Livy for Spark Submit jobs to Spark on the Hadoop cluster.
Livy for Spark2 Submit jobs to Spark2 on the Hadoop cluster.
Setting up a Hadoop cluster for Watson Studio Local entails installing and configuring the four services. Additionally, for kerberized clusters, the setup entails configuring a gateway with JWT-based authentication to securely authenticate requests from Watson Studio Local users.
Important: The set up tasks must be performed by a Hadoop administrator.
Edge node hardware requirements
  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories: /var/log/dsxhi, /var/log/livy, and /var/log/livy2 to store the logs and /var/run/dsxhi, /var/run/livy and /var/run/livy2 to store the process IDs. These locations are not configurable.
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
Edge node requirements
  • Have Python 2.7 installed.
  • CDH only: Have Java Development Kit Version 1.8 installed.
  • Have curl 7.19.7-53 or later to allow secure communication between the Hadoop registration service and Watson Studio Local.
Ports
  • Have an external port for the Hadoop registration service.
  • Have an internal port available for the Hadoop registration REST service.
  • Depending on the service that needs to be exposed by Hadoop registration, have internal ports available for Livy for Spark and Livy for Spark 2.

Security and authentication for Hadoop

Hadoop integration diagram

The Hadoop Integration service is a secure service installed on the edge node of the Hadoop cluster. Access to the service is restricted to an explicit list of Watson Studio Local clusters via a secure URL. Every request from Watson Studio Local includes the JWT token of the signed in Watson Studio Local user, and is authenticated against the secure URL. The Hadoop Integration service extracts the username from the JWT token and propagates the username for all data access and job submission on the Hadoop cluster.

User requirements
  • Ensure every user connecting from Watson Studio Local to Hadoop through the Hadoop Integration service is a valid user on the Hadoop cluster.
  • Identify a service user that can run the Hadoop Integration service.
  • This user should be a valid Linux user on the node where Hadoop Integration service is installed.
  • This user should have a home directory created in HDFS. The directory should have both owner and group assigned as the service user.
  • The service user should have the necessary Hadoop Proxyuser privileges in HDFS.
  • The service user should have the necessary Hadoop Proxyuser privileges in WebHCAT.
  • If you're using an existing Livy service running on the Hadoop cluster, the service user should have the necessary super user privileges in Livy services to access data and submit asynchronous jobs as Watson Studio Local users. .
  • HDP only: If Hadoop or Ranger KMS is enabled, the service user should have necessary proxyuser privileges in kms-site.xml.
  • For a cluster that is kerberized:
    • Have the keytab for the service user. This eliminates the need for every Watson Studio Local user to have a valid keytab.
    • Have a SPNEGO keytab for the node where Hadoop Integration service is installed.
  • For a cluster that is not kerberized: The yarn user should have write access to the directories accessed by the job.

Requirements for a service user installing the Hadoop registration service

If you plan to install the Hadoop registration service as a service user rather than as root, you must first use the visudo command to be add the following access information in /etc/sudoers:

## DSXHI - General Installation (replace <service_user>)
<service_user> ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum install wshi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2

## DSXHI - Service User Specific Commands (replace <service_user>)
<service_user> ALL=(root) NOPASSWD:  /usr/bin/su <service_user> -c hdfs dfs -test -e /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -test -d /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -mkdir /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 755 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 644 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -put -f /opt/ibm/<service_user>/* /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs service_user -rm -r /user/<service_user>/*,  /usr/bin/su <service_user> -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *

## DSXHI - Security (only needed if security is enabled; replace <service_user>, and replace <service_keytab> with the path to the service user keytab)
<service_user> ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c /usr/bin/kdestroy, /usr/bin/cp <service_keytab> /opt/ibm/dsxhi/security/*

## DSXHI - HDP (only needed for HDP)
<service_user> ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *

## DSXHI - CDH (only needed for CDH)
<service_user> ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *

Example of a working, populated configuration for HDP security:

## DSXHI - General Installation
dsxhi ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2

## DSXHI - Service User Specific Commands
dsxhi ALL=(root) NOPASSWD:  /usr/bin/su dsxhi -c hdfs dfs -test -e /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -test -d /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -mkdir /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 755 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 644 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -put -f /opt/ibm/dsxhi/* /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -rm -r /user/dsxhi/*,  /usr/bin/su dsxhi -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *

## DSXHI - Security
dsxhi ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c /usr/bin/kdestroy, /usr/bin/cp /etc/security/svckeytabs/dsxhi.keytab /opt/ibm/dsxhi/security/*

## DSXHI - HDP
dsxhi ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *

## DSXHI - CDH
dsxhi ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *