Table of contents

Set up Cloudera Distribution for Hadoop (CDH) to work with Watson Studio Local

Watson Studio Local provides a Hadoop registration service that eases the setup of a CDH cluster for Watson Studio Local. Using Hadoop registration is the recommended approach, and gives additional functionality of scheduling jobs as YARN application.

Setting up a CDH cluster for Watson Studio Local entails installing and configuring the following four services.
Service Purpose
WebHDFS Browse and preview HDFS data
WebHCAT Browse and preview Hive tables
Livy for Spark Submit jobs to Spark on the Hadoop cluster.
Livy for Spark2 Submit jobs to Spark2 on the Hadoop cluster.
Additionally, for kerberized clusters the setup entails configuring a gateway with JWT based authentication to securely authenticate requests from Watson Studio Local users.
Important: The following tasks must be performed by a Hadoop administrator.

Supported versions

CDH Version 5.12, 5.13, 5.14, 5.15, 5.16, 6.0.1.

Supported platforms

Hadoop registration is supported on all platforms supported by the CDH versions.

Set up a CDH cluster with Hadoop registration

Create an edge node

The Hadoop registration service can be installed on a shared edge node if the resources listed above are exclusively available for Hadoop registration. See Edge node hardware requirements for the hardware and software requirements. When the edge node is successfully created, it should have the following components:

  • The HDFS Gateway Role and YARN Gateway Role installed.
  • The Spark Gateway Role installed if the CDH cluster has a Spark service.
  • The Spark2 client installed if the CDH cluster has a Spark2 service.
Install the Hadoop registration service

To install and configure the Hadoop registration service on the edge node, the Hadoop admin must complete the following tasks:

  1. Download the Hadoop registration RPM to the edge node.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi. If running the install as the service user, run sudo chown <serviceuser> -R /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.CDH files.
  4. Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:
    export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
    export JAVA_CACERTS=/etc/pki/java/cacerts
    export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
  5. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the Hadoop registration service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Cloudera Manager URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (Hadoop registration gateway service and Hadoop registration rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the Hadoop registration rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the Hadoop registration service
Periodically, the Hadoop admin must manage the Hadoop registration service. These tasks include:
Check status of the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Start the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./start.py to start the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Stop the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Add certificates for SSL enabled services
If the WebHDFS service is SSL enabled, the certificates of the nodemanagers and datanodes should to be added to the Hadoop registration gateway trust store. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the nodemanagers and datanodes to add the certificates to Hadoop registration gateway trust store.
Troubleshooting tip: If Hadoop SSL is enabled after DSXHI has been installed, the DSXHI admin needs to manually update /opt/ibm/dsxhi/gateway/conf/topologies/<cluster_topology.xml> to use the correct HTTPS URL and Port in the <cluster_topology.xml>, for example:
<service>
  <role>WEBHDFS</role>
  <url>https://NN.HOST:50470</url>
</service>

Reinstalling DSXHI will also populate all of these properly.

Manage Watson Studio Local for Hadoop registration

To maintain control over the access to a Hadoop registration service, a Hadoop admin needs to maintain a list of known Watson Studio Local clusters that can access the Hadoop registration service. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple Watson Studio Local clusters to the known list by passing in a comma separated list of Watson Studio Local cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Add a Watson Studio Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and the Watson Studio Local admin can be given a URL to securely connect to the Hadoop registration service.
Delete a Watson Studio Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".
Hadoop registration URL for secure access from Watson Studio Local

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known Watson Studio Local clusters and the associated URL that can be used to securely connect from a Watson Studio Local cluster to a Hadoop registration service. The Watson Studio Local admin can then register the Hadoop registration cluster.

Uninstall the Hadoop registration service

To uninstall the Hadoop registration service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Error with Data Rendering in CDH 5.16 & CDH 6.0.1

Symptom: When running Jupyter notebooks against CDH 5.16 or 6.0.1 with Spark 2.2, the first dataframe operation for rendering cell output from the Spark Driver will result in a Json encoding error. Running %%spark -c sql (query) results in DataFrameParseException: Cannot parse object as JSON: '[u'----------------------------------------', u"Exception happened during processing of request from ('127.0.0.1', 36202)", u'----------------------------------------', ...

See Spark JIRA for details.

Two possible workarounds:

  • Manually re-run the first cell that renders data (recommended for interactive notebook sessions).
  • Add the following non-intrusive code a cell after establishing the Spark connection to trigger the first failure (recommended for non-interactive notebook sessions):
    %%spark
    import sys, warnings
    def python_major_version ():
        return(sys.version_info[0])
    with warnings.catch_warnings(record=True):
        print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())

Hive support in CDH 6.0.1

Browse and Preview feature of Hive tables is not supported.