Table of contents

Set up Cloudera Distribution for Hadoop (CDH) to work with DSX Local

DSX Local allows users to securely access the data residing on a CDH cluster and submit jobs to use the compute resources on the CDH cluster. DSX Local interacts with a CDH cluster through four services: WebHDFS, WebHCAT, Livy for Spark and Livy for Spark2. WebHDFS is used to browse and preview HDFS data. WebHCAT is used to browse and preview Hive tables. Livy for Spark and Livy for Spark2 are used to submit jobs to Spark or Spark2 engines on the Hadoop cluster.

Setting up a CDH cluster for DSX Local entails installing and configuring the four services. Additionally, for kerberized clusters the setup entails configuring a gateway with JWT based authentication to securely authenticate requests from DSX Local users. The following tasks have to be performed by a Hadoop admin.

Versions supported

CDH Version 5.12, 5.13, 5.14.

Platforms supported

DSXHI is supported on all platforms supported by the CDH versions.

Available options for set up

Version 1.2 introduces DSXHI service that eases the setup of a CDH cluster for DSX Local. Using DSXHI is the recommended approach, and gives additional functionality of scheduling jobs as YARN application. The approach without using DSXHI continues to be supported.

Option 1: Set up a CDH cluster with DSXHI

DSXHI diagram

The DSXHI service should be installed on an edge node of the CDH cluster. The gateway component will authenticate all incoming request and forward the request to the Hadoop services. In a kerberized cluster, the keytab of the DSXHI service user and the spnego keytab for the edge node will be used to acquire the ticket to communicate with the Hadoop services. All requests to the Hadoop service will be submitted as the DSXL user.

Edge node hardware requirements

  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)

Create an edge node

The DSXHI service can be installed on a shared edge node if the resources listed above are exclusively available for DSXHI. When the edge node is successfully created, it should have the following components:

  • The Hadoop Gateway Role installed.
  • The Spark Gateway Role installed if the CDH cluster has a Spark service.
  • The Spark2 client installed if the CDH cluster has a Spark2 service.

Additional prerequisites

In addition, the following requirements should be met on the edge node:

  • Have Python 2.7 installed.
  • Have curl 7.19.7-53 or later to allow secure communication between DSXHI and DSX Local.
  • Have a service user that can run the DSXHI service. This user should be a valid Linux user with a home directory created in HDFS.
  • The service user should have the necessary Hadoop Proxyuser privileges in HDFS, and WebHCAT services to access data and submit asynchronous jobs as DSX Local users.
  • For a kerberized cluster: Have the keytab for the service user. This eliminates the need for every DSX Local user to have a valid keytab.
  • Have an available port for the DSXHI service. The port for DSXHI service should be exposed for access from the DSX Local clusters that need to connect to the CDH cluster.
  • Have an available port for the DSXHI Rest service. This port need not be exposed for external access.
  • Depending on the service that needs to be exposed by DSXHI, have an available port for Livy for Spark and Livy for Spark 2. These ports do not need to be exposed for external access.

Install DSXHI

To install and configure DSXHI service on the edge node, the Hadoop admin must complete the following tasks:

  1. Download the DSXHI RPM file (dsxhi_<platform>.rpm) to the edge node, for example, dsxhi_x86_64.rpm.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.CDH files.
  4. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the DSXHI service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (DSXHI uses the same license as DSX Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Cloudera Manager URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • The password to be used for creating the self-signed certificate for the DSXHI gateway service. The value can also be passed through the --dsxhi_self_signed_cert_pass flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (DSXHI gateway service and DSXHI rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the DSXHI rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the DSXHI service

Periodically, the Hadoop admin must manage the DSXHI service. These tasks include:

Check status of the DSXHI service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.
Start the DSXHI service
In /opt/ibm/dsxhi/bin, run ./start.py to start the DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.
Stop the DSXHI service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.

Manage DSX Local for DSXHI

To maintain control over the access to a DSXHI service, a Hadoop admin needs to maintain a list of known DSX Local clusters that can access the DSXHI service. A DSX Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple DSXL clusters to the known list by passing in a comma separated list of DSXL cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Add a DSX Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a DSX Local cluster is added to the known list, the necessary authentication will be setup and the DSX admin can be given a URL to securely connect to the DSXHI service.
Delete a DSX Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".

DSXHI URL for secure access from DSX Local

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known DSX Local clusters and the associated URL that can be used to securely connect from a DSX Local cluster to a DSXHI service. The DSX admin can then register the DSXHI cluster.

Uninstall the DSXHI service

To uninstall the DSXHI service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Option 2: Set up a CDH cluster without DSXHI

A CDH cluster can be set up for the DSX Local cluster without using the DSXHI service. If your CDH cluster does not use Kerberos security, then ensure DSX Local can access the host and port of the services. No additional configuration is needed. If the CDH cluster uses Kerberos security, follow the steps outlined below.

CDH requirements (see CDH documentation for guidance):

  • Knox v0.13 and Livy v0.3.0 installed.
  • Knox service is installed with SSL enabled.
  • The Livy service for Spark must be set up to be accessible through Knox. You must add the service.xml file to <Knox installation directory>/data/services/livy/0.3.0/. See Adding a service to Apache Knox for details.
  • The Livy service for Spark should have the rewrite rule definition to support impersonation. You can verify this by checking that the rewrite.xml file exists in <Knox installation directory>/data/services/livy/0.3.0/. See Adding Livy Server as service to Apache Knox for details.
  • In the <Livy installation directory>/conf/livy.conf file, set the following property: livy.superusers=knox.

To configure the CDH cluster, you must create a new Knox topology named dsx that is based on JWT authentication and has the service entries for Livy for Spark, WebHDFS, and WebHCAT. Complete the following steps:

  1. Go to https://9.87.654.320/auth/jwtcert (where https://9.87.654.320 represents the DSX Local URL) and save the public SSL certificate jwt.cert. Alternatively, run a curl command to download the SSL certificate from DSX Local:

    curl -k https://9.87.654.320/auth/jwtcert
    
  2. In the <Knox installation directory>/conf/topologies directory of your Knox server, create a new topology for DSX named dsx.xml, and paste the key from the SSL certificate (between BEGIN CERTIFICATE and END CERTIFICATE) into the <value> tag. Also, ensure you have service entries for Livy for Spark, WebHDFS, and WebHCAT. Example:

    <topology>
    <gateway>
     <provider>
        <role>federation</role>
        <name>JWTProvider</name>
        <enabled>true</enabled>
        <param>
        <name>knox.token.verification.pem</name>
        <value>MIIDb...Zpuw</value>
        </param>
      </provider>
     <provider>
        <role>identity-assertion</role>
        <name>Default</name>
        <enabled>true</enabled>
     </provider>
     <provider>
        <role>authorization</role>
        <name>AclsAuthz</name>
        <enabled>true</enabled>
     </provider>
    </gateway>
    <service>
     <role>LIVYSERVER</role>
     <url>http://9.87.654.323:8998</url>
    </service>                  
    <service>
     <role>WEBHDFS</role>
     <url>http://9.87.654.321:50070/webhdfs</url>
    </service>
    <service>
     <role>WEBHCAT</role>
     <url>http://9.87.543.324:50111/templeton</url>
    </service>
    </topology>
    
  3. Touch the dsx.xml file to update the timestamp on it.

  4. Restart the Knox server to detect the new topology. The URLs for the four services will be:

  5. Configure DSX Local to work with the CDH cluster. See set up for details.