Table of contents

Set up Hortonworks Data Platform (HDP) to work with Watson Studio Local

Watson Studio Local allows users to securely access the data residing on a HDP cluster and submit jobs to use the compute resources on the HDP cluster. Watson Studio Local interacts with a HDP cluster through four services: WebHDFS, WebHCAT, Livy for Spark and Livy for Spark2. WebHDFS is used to browse and preview HDFS data. WebHCAT is used to browse and preview Hive tables. Livy for Spark and Livy for Spark2 are used to submit jobs to Spark or Spark2 engines on the Hadoop cluster.

Setting up an HDP cluster for Watson Studio Local entails installing and configuring the four services. Additionally, for kerberized clusters the setup entails configuring a gateway with JWT based authentication to securely authenticate requests from Watson Studio Local users. The following tasks have to be performed by a Hadoop admin.

Versions supported

  • HDP Version 2.5.6 and later fixpacks
  • HDP Version 2.6.2 and later fixpacks
  • HDP Version 3.0 (Data Access and Spark Analytics for HDFS data)

Platforms supported

Hadoop registration is supported on all platforms supported by the HDP versions.

Available options for set up

Version 1.2 introduces a Hadoop registration service that eases the setup of a HDP cluster for Watson Studio Local. Using Hadoop registration is the recommended approach, and gives additional functionality of scheduling jobs as YARN application. The approach without using Hadoop registration continues to be supported.

Option 1: Set up a HDP cluster with Hadoop registration

Hadoop integration diagram

The Hadoop registration service should be installed on an edge node of the HDP cluster. The gateway component will authenticate all incoming request and forward the request to the Hadoop services. In a kerberized cluster, the keytab of the Hadoop registration service user and the spnego keytab for the edge node will be used to acquire the ticket to communicate with the Hadoop services. All requests to the Hadoop service will be submitted as the Watson Studio Local user.

Edge node hardware requirements
  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
Create an edge node
The Hadoop registration service can be installed on a shared edge node if the resources listed above are exclusively available for Hadoop registration. To create a new edge node, follow the steps in the HDP documentation. When the edge node is successfully created, it should have the following components:
  • The Hadoop client installed.
  • The Spark client installed if the HDP cluster has a Spark service.
  • The Spark2 client installed if the HDP cluster has a Spark2 service.
  • For a kerberized cluster, have the spnego keytab copied to /etc/security/keytabs/spnego.service.keytab.
Additional prerequisites
In addition, the following requirements should be met on the edge node:
  • Have Python 2.7 installed.
  • Have curl 7.19.7-53 or later to allow secure communication between the Hadoop registration service and Watson Studio Local.
  • Have a service user that can run the Hadoop registration service. This user should be a valid Linux user with a home directory created in HDFS.
  • The service user should have the necessary Hadoop Proxyuser privileges in HDFS, WebHCAT and Livy services to access data and submit asynchronous jobs as Watson Studio Local users.
  • For a kerberized cluster: Have the keytab for the service user. This eliminates the need for every Watson Studio Local user to have a valid keytab.
  • Have an available port for the Hadoop registration service. The port for the Hadoop registration service should be exposed for access from the Watson Studio Local clusters that need to connect to the HDP cluster.
  • Have an available port for the Hadoop registration Rest service. This port need not be exposed for external access.
  • Depending on the service that needs to be exposed by Hadoop registration, have an available port for Livy for Spark and Livy for Spark 2. These ports do not need to be exposed for external access.
Install the Hadoop registration service
To install and configure the Hadoop registration service on the edge node, the Hadoop admin must complete the following tasks:
  1. Download the Hadoop registration RPM file ( dsxhi_<platform>.rpm) to the edge node, for example, dsxhi_x86_64.rpm.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.HDP files. When installing on Power platform, set package_installer_tool=yum and packages=lapack for the installer to install the necessary packages needed for virtual environments.
  4. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the Hadoop registration service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Ambari URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (Hadoop registration gateway service and Hadoop registration rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the Hadoop registration rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the Hadoop registration service
Periodically, the Hadoop admin must manage the Hadoop registration service. These tasks include:
Check status of the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Start the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./start.py to start the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Stop the Hadoop registration service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the Hadoop registration gateway, Hadoop registration rest server, Livy for Spark and Livy for Spark services.
Add certificates for SSL enabled services
If the WebHDFS service is SSL enabled, the certificates of the nodemanagers and datanodes should to be added to the Hadoop registration gateway trust store. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the nodemanagers and datanodes to add the certificates to Hadoop registration gateway trust store.
Alluxio requirement: To connect to Alluxio using remote Spark with Livy, go to Ambari > HDFS > Config > Custom Core-site > Add property in the Ambari web client and fs.alluxio.impl configuration for the remote Spark. See Running Spark on Alluxio and Connect to a remote Spark in an HDP cluster using Alluxio for details.
Manage Watson Studio Local for Hadoop registration
To maintain control over the access to a Hadoop registration service, a Hadoop admin needs to maintain a list of known Watson Studio Local clusters that can access the Hadoop registration service. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple Watson Studio Local clusters to the known list by passing in a comma separated list of Watson Studio Local cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.
Add a Watson Studio Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and the Watson Studio Local admin can be given a URL to securely connect to the Hadoop registration service.
Delete a Watson Studio Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".
Hadoop registration URL for secure access from Watson Studio Local
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known Watson Studio Local clusters and the associated URL that can be used to securely connect from a Watson Studio Local cluster to a Hadoop registration service. The Watson Studio Local admin can then register the Hadoop registration cluster.
Uninstall the Hadoop registration service
To uninstall the Hadoop registration service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Option 2: Set up a HDP cluster without Hadoop registration

A HDP cluster can be set up for the Watson Studio Local cluster without using the Hadoop registration service. If your HDP cluster does not use Kerberos security, then ensure Watson Studio Local can access the host and port of the services. No additional configuration is needed. If the HDP cluster uses Kerberos security, follow the steps outlined below.

HDP requirements (see HDP documentation for guidance):

  • Knox service is installed with SSL enabled.
  • The Livy service for Spark or Livy service for Spark 2 must be set up to be accessible through Knox. You can verify this by checking that the service.xml file exists in either /usr/hdp/current/knox-server/data/services/livy/0.1.0/ or /usr/hdp/current/knox-server/data/services/livy2/0.1.0/. See Adding Livy Server as service to Apache Knox for details (adjust the steps accordingly for Livy service for Spark 2).
  • The Livy service for Spark or Livy service for Spark 2 should have the rewrite rule definition to support impersonation. You can verify this by checking that the rewrite.xml file exists in either /usr/hdp/current/knox-server/data/services/livy/0.1.0/ or /usr/hdp/current/knox-server/data/services/livy2/0.1.0/. See Adding Livy Server as service to Apache Knox for details (adjust the steps accordingly for Livy service for Spark 2).
  • In Spark > configs in the Ambari web client, you must edit the Livy conf file for Spark to add the following property: livy.superusers=knox and restart the Spark service.
  • In Spark2 > configs in the Ambari web client, you must edit the Livy conf file for Spark2 to add the following property: livy.superusers=knox and restart the Spark2 service.

To configure the HDP cluster, you must create a new Knox topology named dsx that is based on JWT authentication and has the service entries for Livy for Spark, Livy for Spark 2, WebHDFS, and WebHCAT. Complete the following steps:

  1. Go to https://9.87.654.320/auth/jwtcert (where https://9.87.654.320 represents the Watson Studio Local URL) and save the public SSL certificate jwt.cert. Alternatively, run a curl command to download the SSL certificate from Watson Studio Local:
    curl -k https://9.87.654.320/auth/jwtcert
  2. In the /usr/hdp/current/knox-server/conf/topologies directory of your Knox server, create a new topology for Watson Studio Local named dsx.xml, and paste the key from the SSL certificate (between BEGIN CERTIFICATE and END CERTIFICATE) into the <value> tag. Also, ensure you have service entries for Livy for Spark, Livy for Spark 2, WebHDFS, and WebHCAT. Example:
    <topology>
    <gateway>
     <provider>
        <role>federation</role>
        <name>JWTProvider</name>
        <enabled>true</enabled>
        <param>
        <name>knox.token.verification.pem</name>
        <value>MIIDb...Zpuw</value>
        </param>
      </provider>
     <provider>
        <role>identity-assertion</role>
        <name>Default</name>
        <enabled>true</enabled>
     </provider>
     <provider>
        <role>authorization</role>
        <name>AclsAuthz</name>
        <enabled>true</enabled>
     </provider>
    </gateway>
    <service>
     <role>LIVYSERVER</role>
     <url>http://9.87.654.323:8998</url>
    </service>                  
    <service>
     <role>LIVYSERVER2</role>
     <url>http://9.87.654.322:8999</url>
    </service>
    <service>
     <role>WEBHDFS</role>
     <url>http://9.87.654.321:50070/webhdfs</url>
    </service>
    <service>
     <role>WEBHCAT</role>
     <url>http://9.87.543.324:50111/templeton</url>
    </service>
    </topology>
  3. Touch the dsx.xml file to update the timestamp on it.
  4. Restart the Knox server to detect the new topology. The URLs for the four services will be:
    https://knoshost:8443/gateway/dsx/webhdfs/v1
    https://knoshost:8443/gateway/dsx/templeton/v1
    https://knoshost:8443/gateway/dsx/livy/v1
    https://knoshost:8443/gateway/dsx/livy2/v1
  5. Configure Watson Studio Local to work with the HDP cluster. See set up for details.