Table of contents

Set up Cloudera Distribution for Hadoop (CDH) to work with Watson Studio Local

Watson Studio Local allows users to securely access the data residing on a CDH cluster and submit jobs to use the compute resources on the CDH cluster. Watson Studio Local interacts with a CDH cluster through four services: WebHDFS, WebHCAT, Livy for Spark and Livy for Spark2. WebHDFS is used to browse and preview HDFS data. WebHCAT is used to browse and preview Hive tables. Livy for Spark and Livy for Spark2 are used to submit jobs to Spark or Spark2 engines on the Hadoop cluster.

Setting up a CDH cluster for Watson Studio Local entails installing and configuring the four services. Additionally, for kerberized clusters the setup entails configuring a gateway with JWT based authentication to securely authenticate requests from Watson Studio Local users. The following tasks have to be performed by a Hadoop admin.

Versions supported

CDH Version 5.12, 5.13, 5.14.

Platforms supported

Hadoop registration is supported on all platforms supported by the CDH versions.

Available options for set up

Version 1.2 introduces a Hadoop registration service that eases the setup of a CDH cluster for Watson Studio Local. Using Hadoop registration is the recommended approach, and gives additional functionality of scheduling jobs as YARN application. The approach without using Hadoop registration continues to be supported.

Option 1: Set up a CDH cluster with Hadoop registration

Hadoop integration diagram

The Hadoop registration service should be installed on an edge node of the CDH cluster. The gateway component will authenticate all incoming request and forward the request to the Hadoop services. In a kerberized cluster, the keytab of the Hadoop registration service user and the spnego keytab for the edge node will be used to acquire the ticket to communicate with the Hadoop services. All requests to the Hadoop service will be submitted as the Watson Studio Local user.

Edge node hardware requirements
  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
Create an edge node

The Hadoop registration service can be installed on a shared edge node if the resources listed above are exclusively available for Hadoop registration. When the edge node is successfully created, it should have the following components:

  • The Hadoop Gateway Role installed.
  • The Spark Gateway Role installed if the CDH cluster has a Spark service.
  • The Spark2 client installed if the CDH cluster has a Spark2 service.
Additional prerequisites

In addition, the following requirements should be met on the edge node:

  • Have Python 2.7 installed.
  • Have curl 7.19.7-53 or later to allow secure communication between the Hadoop registration service and Watson Studio Local.
  • Have a service user that can run the Hadoop registration service. This user should be a valid Linux user with a home directory created in HDFS.
  • The service user should have the necessary Hadoop Proxyuser privileges in HDFS, and WebHCAT services to access data and submit asynchronous jobs as Watson Studio Local users.
  • For a kerberized cluster: Have the keytab for the service user. This eliminates the need for every Watson Studio Local user to have a valid keytab.
  • Have an available port for the Hadoop registration service. The port for the Hadoop registration service should be exposed for access from the Watson Studio Local clusters that need to connect to the CDH cluster.
  • Have an available port for the Hadoop registration Rest service. This port need not be exposed for external access.
  • Depending on the service that needs to be exposed by Hadoop registration, have an available port for Livy for Spark and Livy for Spark 2. These ports do not need to be exposed for external access.
Install the Hadoop registration service

To install and configure the Hadoop registration service on the edge node, the Hadoop admin must complete the following tasks:

  1. Download the Hadoop registration RPM file (dsxhi_<platform>.rpm) to the edge node, for example, dsxhi_x86_64.rpm.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.CDH files.
  4. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the Hadoop registration service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Cloudera Manager URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (Hadoop registration gateway service and Hadoop registration rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the Hadoop registration rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the Hadoop registration service
Periodically, the Hadoop admin must manage the Hadoop registration service. These tasks include:
Check status of the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Start the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./start.py to start the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Stop the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Add certificates for SSL enabled services
If the WebHDFS service is SSL enabled, the certificates of the nodemanagers and datanodes should to be added to the Hadoop registration gateway trust store. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the nodemanagers and datanodes to add the certificates to Hadoop registration gateway trust store.
Manage Watson Studio Local for Hadoop registration

To maintain control over the access to a Hadoop registration service, a Hadoop admin needs to maintain a list of known Watson Studio Local clusters that can access the Hadoop registration service. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple Watson Studio Local clusters to the known list by passing in a comma separated list of Watson Studio Local cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Add a Watson Studio Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and the Watson Studio Local admin can be given a URL to securely connect to the Hadoop registration service.
Delete a Watson Studio Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".
Hadoop registration URL for secure access from Watson Studio Local

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known Watson Studio Local clusters and the associated URL that can be used to securely connect from a Watson Studio Local cluster to a Hadoop registration service. The Watson Studio Local admin can then register the Hadoop registration cluster.

Uninstall the Hadoop registration service

To uninstall the Hadoop registration service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Option 2: Set up a CDH cluster without Hadoop registration

A CDH cluster can be set up for the Watson Studio Local cluster without using the Hadoop registration service. If your CDH cluster does not use Kerberos security, then ensure Watson Studio Local can access the host and port of the services. No additional configuration is needed. If the CDH cluster uses Kerberos security, follow the steps outlined below.

CDH requirements (see CDH documentation for guidance):

  • Knox v0.13 and Livy v0.3.0 installed.
  • Knox service is installed with SSL enabled.
  • The Livy service for Spark must be set up to be accessible through Knox. You must add the service.xml file to <Knox installation directory>/data/services/livy/0.3.0/. See Adding a service to Apache Knox for details.
  • The Livy service for Spark should have the rewrite rule definition to support impersonation. You can verify this by checking that the rewrite.xml file exists in <Knox installation directory>/data/services/livy/0.3.0/. See Adding Livy Server as service to Apache Knox for details.
  • In the <Livy installation directory>/conf/livy.conf file, set the following property: livy.superusers=knox.

To configure the CDH cluster, you must create a new Knox topology named dsx that is based on JWT authentication and has the service entries for Livy for Spark, WebHDFS, and WebHCAT. Complete the following steps:

  1. Go to https://9.87.654.320/auth/jwtcert (where https://9.87.654.320 represents the Watson Studio Local URL) and save the public SSL certificate jwt.cert. Alternatively, run a curl command to download the SSL certificate from Watson Studio Local:
    curl -k https://9.87.654.320/auth/jwtcert
  2. In the <Knox installation directory>/conf/topologies directory of your Knox server, create a new topology for Watson Studio Local named dsx.xml, and paste the key from the SSL certificate (between BEGIN CERTIFICATE and END CERTIFICATE) into the <value> tag. Also, ensure you have service entries for Livy for Spark, WebHDFS, and WebHCAT. Example:
    <topology>
    <gateway>
     <provider>
        <role>federation</role>
        <name>JWTProvider</name>
        <enabled>true</enabled>
        <param>
        <name>knox.token.verification.pem</name>
        <value>MIIDb...Zpuw</value>
        </param>
      </provider>
     <provider>
        <role>identity-assertion</role>
        <name>Default</name>
        <enabled>true</enabled>
     </provider>
     <provider>
        <role>authorization</role>
        <name>AclsAuthz</name>
        <enabled>true</enabled>
     </provider>
    </gateway>
    <service>
     <role>LIVYSERVER</role>
     <url>http://9.87.654.323:8998</url>
    </service>                  
    <service>
     <role>WEBHDFS</role>
     <url>http://9.87.654.321:50070/webhdfs</url>
    </service>
    <service>
     <role>WEBHCAT</role>
     <url>http://9.87.543.324:50111/templeton</url>
    </service>
    </topology>
  3. Touch the dsx.xml file to update the timestamp on it.
  4. Restart the Knox server to detect the new topology. The URLs for the four services will be:
    • https://knoshost:8443/gateway/dsx/webhdfs/v1
    • https://knoshost:8443/gateway/dsx/templeton/v1
    • https://knoshost:8443/gateway/dsx/livy/v1
    • https://knoshost:8443/gateway/dsx/livy2/v1
  5. Configure Watson Studio Local to work with the CDH cluster. See set up for details.