Table of contents

Hadoop integration

The Watson Studio Local Hadoop Integration Service is a registration service that can be installed on a Hadoop edge node to allow Watson Studio Local Version 1.2 or later clusters to securely access data residing on the Hadoop cluster, submit interactive Spark jobs, build models, and schedule jobs that run as a YARN application on the Hadoop cluster.

Important: Your Hadoop admin must install the Hadoop registration service on either a HDP cluster or CDH cluster, add the Watson Studio Local cluster to the Hadoop registration to be managed, and provide the Hadoop registration service user ID and the URL specific to a Watson Studio Local cluster that can be used for securely connecting to the Hadoop registration service.

A Watson Studio Local administrator can then perform the following tasks in the Admin Console:

  1. Register a Hadoop cluster
  2. View details about the registered Hadoop cluster
  3. Push runtime images to the registered Hadoop cluster
  4. Work with images on the Hadoop cluster

Register a Hadoop cluster

In the Admin Console, click the menu icon ( The menu icon) and click Hadoop Integration to register your Hadoop clusters and create images for virtual runtime environments on them. When a Hadoop registration service is registered in Watson Studio Local, depending on the configuration of the registered Hadoop system, HDFS and Hive data sources are automatically created. Watson Studio Local users can then list available registered Hadoop Livy endpoints to be used with Jupyter, RStudio, and Zeppelin, work with data sources and Livy, and select registered Hadoop images as workers to remotely submit Python jobs to them.

To register the endpoint to a new Hadoop cluster installation, click Add Registration. Name the registration and provide the authentication information.

Hadoop integration

Troubleshooting tip: If the registration fails, view the kubectl logs of the utils-api pod.

View details about the registered Hadoop cluster

In the Details page of each registration, you can view its endpoints and runtimes.

Depending on the services that are exposed when installing and configuring the Hadoop registration service on the Hadoop edge node, the details page lists the WebHDFS, WebHCAT, Livy for Spark and Livy for Spark 2 URLs exposed by the Hadoop registration.

  • If WebHDFS and/or WebHCAT services are exposed in conjunction with the Livy service, Watson Studio Local users can work with the data sources associated with these services without having to explicitly create them.
  • If Livy for Spark and/or Livy for Spark 2 services are exposed, Watson Studio Local users can list these endpoints through dsx_core_utils and dsxCoreUtilsR, and use them as defined Livy endpoint in Jupyter, Rstudio and Zeppelin notebooks. Python syntax:
    %python
    import dsx_core_utils;
    dsx_core_utils.list_dsxhi_livy_endpoints();
    R syntax:
    library('dsxCoreUtilsR');
    dsxCoreUtilsR::listDSXHILivyEndpoints()

If a registered endpoint changes, you can refresh the registration.

Push runtime images to the registered Hadoop cluster

Users can work with the Python packages available in Watson Studio Local when running on the Hadoop cluster.

Requirement: The Watson Studio Local and Hadoop cluster must be on the same platform architecture for working with the runtime images on Hadoop.

The Watson Studio Local administrator can view the default images and the custom images created by Watson Studio Local users and push or replace the image on the Hadoop cluster. To push a runtime image to the registered Hadoop cluster from its details page, click Push next to the image. Note that pushing the image can take a long time. If you modified any of the runtime images locally, you can update it on the remote cluster by clicking Replace Image next to it.

Runtimes can have the following statuses:

  • Available on Watson Studio Local, but not pushed to the registered Hadoop cluster. Users can either push or refresh the environment to the registered Hadoop cluster.
  • Pending transfer from Watson Studio Local to the registered Hadoop cluster.
  • Failed transfer from Watson Studio Local to the registered Hadoop cluster.
  • Available on the registered Hadoop cluster. Watson Studio Local users can select the remote image as a worker, select a Target host, and submit jobs to it.

Work with images on the Hadoop cluster

Watson Studio Local users can work with the pushed images in the notebooks and jobs environment. Example scenario:

  1. The Watson Studio Local user configures their worker page to select the custom image for the worker.
  2. The Watson Studio Local user creates a script run job using the customized worker, and selects the registered Hadoop system as the target host.
  3. The Watson Studio Local user runs the job, selecting the registered Hadoop system as the target host.
Tip: When a Watson Studio Local administrator pushes a custom image to a registered Hadoop system, a notebook can initialize a remote Spark Livy session which runs within an environment associated with that remote custom image:
import dsx_core_utils
dsx_core_utils.get_dsxhi_info(showSummary=True)
myConfig={
   "numExecutors": 3,
}
dsx_core_utils.setup_livy_sparkmagic(system='edge', livy='livyspark2', imageId='arrow-730-dsx-scripted-ml-python2',addlConfig=myConfig)
%reload_ext sparkmagic.magics
%spark add -s session01 -l python -u https://9.87.654.321:8443/gateway/9.87.654.322/livy2/v1 -k

%%spark
...data analysis...

%spark delete -s session-01 -u https://9.87.654.321:8443/gateway/dx-sv-123-45/livy/v1 -k