Table of contents

Set up Hortonworks Data Platform (HDP) to work with Watson Studio Local

Watson Studio Local provides a Hadoop registration service that eases the setup of an HDP cluster for Watson Studio Local. Using the Hadoop registration service is the recommended approach, and gives additional functionality of scheduling jobs as YARN application.

Setting up an HDP cluster for Watson Studio Local entails installing and configuring the following four services.
Service Purpose
WebHDFS Browse and preview HDFS data
WebHCAT Browse and preview Hive tables
Livy for Spark Submit jobs to Spark on the Hadoop cluster.
Livy for Spark2 Submit jobs to Spark2 on the Hadoop cluster.
Additionally, for kerberized clusters, the setup entails configuring a gateway with JWT-based authentication to securely authenticate requests from Watson Studio Local users.
Important: The following tasks must be performed by a Hadoop administrator.

Supported versions

Platforms supported

Hadoop registration is supported on all platforms supported by the HDP versions.

Set up a HDP cluster with Hadoop registration

Create an edge node
The Hadoop registration service can be installed on a shared edge node if the resources listed above are exclusively available for Hadoop registration. See Edge node hardware requirements for the hardware and software requirements. When the edge node is successfully created, it should have the following components:
  • The HDFS Gateway Role and YARN Gateway Role installed.
  • The Spark client installed if the HDP cluster has a Spark service.
  • The Spark2 client installed if the HDP cluster has a Spark2 service.
  • For a kerberized cluster, have the spnego keytab copied to /etc/security/keytabs/spnego.service.keytab.
Install the Hadoop registration service
To install and configure the Hadoop registration service on the edge node, the Hadoop admin must complete the following tasks:
  1. Download the Hadoop registration RPM to the edge node.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi. If running the install as the service user, run sudo chown <serviceuser> -R /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.HDP files. When installing on Power platform, set package_installer_tool=yum and packages=lapack for the installer to install the necessary packages needed for virtual environments.
  4. Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:
    export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
    export JAVA_CACERTS=/etc/pki/java/cacerts
    export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
  5. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the Hadoop registration service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Ambari URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (Hadoop registration gateway service and Hadoop registration rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the Hadoop registration rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the Hadoop registration service
Periodically, the Hadoop admin must manage the Hadoop registration service. These tasks include:
Check status of the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Start the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./start.py to start the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Stop the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Add certificates for SSL enabled services
If the WebHDFS service is SSL enabled, the certificates of the nodemanagers and datanodes should to be added to the Hadoop registration gateway trust store. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the nodemanagers and datanodes to add the certificates to Hadoop registration gateway trust store.
Troubleshooting tip: If Hadoop SSL is enabled after DSXHI has been installed, the DSXHI admin needs to manually update /opt/ibm/dsxhi/gateway/conf/topologies/<cluster_topology.xml> to use the correct HTTPS URL and Port in the <cluster_topology.xml>, for example:
<service>
  <role>WEBHDFS</role>
  <url>https://NN.HOST:50470</url>
</service>

Reinstalling DSXHI will also populate all of these properly.

Alluxio requirement: To connect to Alluxio using remote Spark with Livy, go to Ambari > HDFS > Config > Custom Core-site > Add property in the Ambari web client and fs.alluxio.impl configuration for the remote Spark. See Running Spark on Alluxio and Connect to a remote Spark in an HDP cluster using Alluxio for details.
Manage Watson Studio Local for Hadoop registration
To maintain control over the access to a Hadoop registration service, a Hadoop admin needs to maintain a list of known Watson Studio Local clusters that can access the Hadoop registration service. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple Watson Studio Local clusters to the known list by passing in a comma separated list of Watson Studio Local cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.
Add a Watson Studio Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and the Watson Studio Local admin can be given a URL to securely connect to the Hadoop registration service.
Delete a Watson Studio Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".
Hadoop registration URL for secure access from Watson Studio Local
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known Watson Studio Local clusters and the associated URL that can be used to securely connect from a Watson Studio Local cluster to a Hadoop registration service. The Watson Studio Local admin can then register the Hadoop registration cluster.
Uninstall the Hadoop registration service
To uninstall the Hadoop registration service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Work with Hive in HDP Version 3.0.1 and 3.1

The dsx-samples project includes a sample notebook that explains how an analytics application can access hive tables in a spark catalog and hive catalog. Note that the Browse and Preview feature of Hive tables is not supported.