Table of contents

Hadoop integration

DSX Hadoop Integration Service (DSXHI) is a service that can be installed on a Hadoop edge node to allow DSX Local Version 1.2 or later clusters to securely access data residing on the Hadoop cluster, submit interactive Spark jobs, build models, and schedule jobs that run as a YARN application on the Hadoop cluster.

Requirement: Your Hadoop admin must install the DSXHI service on either a HDP cluster or CDH cluster, add the DSX Local cluster to DSXHI to be managed, and provide the DSXHI service user ID and the URL specific to a DSX Local cluster that can be used for securely connecting to the DSXHI service.

A DSX administrator can then perform the following tasks in the Admin Console:

  1. Register a DSX Hadoop cluster
  2. View details about the DSXHI cluster
  3. Push images to the DSX Hadoop cluster

Register a DSXHI cluster

In the Admin Console, click the menu icon (The menu icon) and click Hadoop Integration to register your DSXHI clusters and create images for virtual runtime environments on them. When a DSXHI service is registered in DSX Local, depending on the configuration of the DSXHI system, HDFS and Hive data sources are automatically created. DSX users can then list available DSXHI Livy endpoints to be used with Jupyter, RStudio, and Zeppelin, work with data sources and Livy, and select DSXHI images as workers to remotely submit Python jobs to them.

To register the endpoint to a new DSXHI cluster installation, click Add Registration. Name the registration and provide the authentication information.

Hadoop integration

View details about the DSXHI cluster

In the Details page of each registration, you can view its endpoints and runtimes.

Depending on the services that are exposed when installing and configuring the DSXHI service on the Hadoop edge node, the details page lists the WebHDFS, WebHCAT, Livy for Spark and Livy for Spark 2 URLs exposed by DSXHI.

  • If WebHDFS and/or WebHCAT services are exposed in conjunction with the Livy service, DSX users can work with the data sources associated with these services without having to explicitly create them.
  • If Livy for Spark and/or Livy for Spark 2 services are exposed, DSX users can list these endpoints through dsx_core_utils and dsxCoreUtilsR, and use them as defined Livy endpoint in Jupyter, Rstudio and Zeppelin notebooks. Python syntax:
    %python
    import dsx_core_utils;
    dsx_core_utils.list_dsxhi_livy_endpoints();
    
    R syntax:
    library('dsxCoreUtilsR');
    dsxCoreUtilsR::listDSXHILivyEndpoints()
    

If a registered endpoint changes, you can refresh the registration.

Push runtime images to the DSXHI cluster

Before users can select workers on the DSXHI system to submit jobs to them, the DSX administrator must first push the corresponding images to the DSXHI system.

To push a runtime image to the DSXHI cluster from its details page, click Push next to the image. Note that pushing the image can take a long time. If you modified any of the runtime images locally, you can update it on the remote cluster by clicking Replace Image next to it.

Runtimes can have the following statuses:

  • Available on DSXL, but not pushed to DSXHI. Users can either push or refresh the environment to DSXHI.
  • Pending transfer from DSXL to DSXHI.
  • Failed transfer from DSXL to DSXHI.
  • Available on DSXHI. DSX Local users can select the remote image as a worker, select a Target host, and submit jobs to it.