Table of contents

Connect remotely to an external Spark

Using HTTP REST APIs from the Apache Livy service, you can connect remotely to an external Spark from your Watson Studio Local notebook.

Restriction:
  • You must install Apache Livy on the remote Spark site. Livy provides a URL to interact with the remote Spark.
  • You must install all the dependencies (any libraries and packages) for your Spark code on the remote Spark site.

See Apache Livy Examples for more details on how a Python, Scala, or R notebook can connect to the remote Spark site.

Tasks you can perform:

To connect to the remote Spark site, create the Livy session (either by UI mode or command mode) by using the REST API endpoint. The endpoint must include the Livy URL, port number, and authentication type. Sparkmagic example:

%spark add -s session1 -l python -u
https://my.hdp.system.com:8443/gateway/default/livy/v1 -a u -k

where session1 represents the session name and python represents the notebook language.

You can enter %spark? or %%help for the list of commands.

If you want subsequent lines in your notebook to use the Watson Studio Local Spark service, then you can specify %%local. Otherwise, the cell defaults to the remote Spark service. If you want data to be returned in the local notebook, use %%spark -o dd.

Afterward, ensure that you delete the Livy session to release the Spark resource:

%spark delete session1

Set the default Livy URL for Watson Studio Local

A Watson Studio Local administrator can configure a default Livy endpoint URL for Watson Studio Local by completing the following steps:

  1. In a command shell, sign in to the Watson Studio Local master node as root.
  2. Set the default Livy endpoint by running the /wdp/utils/add_endpoint.sh script. Example for an unsecured Livy endpoint:
    add_endpoint.sh --livy-url=http://my.hdp.system.com:8998

    Example for a secure Livy endpoint that requires an SSL certificate:

    add_endpoint.sh --addcert --livy-url=https://my.hdp.system.com:8443/gateway/default/livy/v1

As a result, the script automatically creates a default_endpoints.conf file in the /user-home/_global_/config directory. The file contains the line:

livy_endpoint=https://9.87.654.321:8443/gateway/default/livy/v1

Alternatively, in the Admin Console, go to Scripts and in the Scripts pull-down menu, select Add the default Livy endpoint for Watson Studio (add_endpoint.sh) to perform the same tasks.

add_endpoint.sh

Create a Livy session on a secure HDP cluster using JWT authentication

All Livy sessions initiated from Watson Studio Local to a secure HDP cluster will authenticate using the JWT token belonging to the logged in Watson Studio Local user. The proxy user for these sessions will be the Watson Studio Local logged in user, and the HDP admin can monitor the jobs submitted by each Watson Studio Local user.

Restriction: The HDP cluster must be set up to work with Watson Studio Local, and Watson Studio Local must be configured to work with the HDP cluster.
Zeppelin example

When Zeppelin starts up, all Livy interpreters are set up with the JWT token belonging to the user. Additionally, all Livy interpreters that do not have the zeppelin.livy.url property set, or have the default value, are updated to point to the Livy endpoint setup in /user-home/_global_/config/default_endpoints.conf.

To specifically update new interpreters created after the Zeppelin startup, run:

%python
import dsx_core_utils
dsx_core_utils.setup_livy_zeppelin()
Jupyter Python example

Set up:

%load_ext sparkmagic.magics
import dsx_core_utils
dsx_core_utils.setup_livy_sparkmagic()
%reload_ext sparkmagic.magics

Create a session from the Sparkmagic client:

%manage_spark
Jupyter R and RStudio example

To create a Spark context based on the default Livy endpoint:

library("dsxCoreUtilsR")
sc <- dsxCoreUtilsR::createLivySparkContext()

To create a Spark context based on a specific Livy endpoint:

library("dsxCoreUtilsR")
sc <-
dsxCoreUtilsR::createLivySparkContext(livy_endpoint="https://9.87.654.323:8443/gateway/dsx/livy/v1")

To disconnect the Spark context:

dsxCoreUtilsR::disconnectLivySparkContext(sc)

Query Hive data set on an HDP cluster

To quickly access a small data set, you can use a webHCAT-based Hive query (if your dataset is large, use the HDP Spark through Livy instead):

  1. Access the authentication token:
    import os
    dsx_token  = os.environ['DSX_TOKEN']
  2. Issue the Hive query to the HDP cluster configured to work with Watson Studio Local:
    %%bash -s "$dsx_token"
    curl -s -H "X-Requested-By: dsxuser" -H "Authorization: $1" -d
    execute='select+*+from+hiveris.drivers+limit+10;' \
    -d statusdir='dans2' \
    'https://9.87.654.321:8443/gateway/dsx/templeton/v1/hive?user.name=user1'
    -k

    This webHCAT request returns a job id, for example, {"id":"job_1512769862398_0157"}. In the cell, you can replace oto3.fyre.ibm.com with the name of your HDP cluster Hive templeton server hostname. The query might take some time to run. You should check the status of the query for completion by visiting the yarn cluster metrics dashboard, for example, http://9.87.654.321:8088/cluster. When the job completes, you can check the result. If the job does not complete, check the logs on the yarn cluster metrics and contact the HDP admin to resolve issues.

  3. Retrieve results of the query stored on HDFS:
    %%bash -s "$dsx_token"
    curl -i  -H "X-Requested-By: dsxuser" -H "Authorization: $1" -H
    "Content-Type: application/json" -L
    "https://<hadoophdfshostname>:8443/gateway/dsx/webhdfs/v1/user/user1/dans2/stdout?op=OPEN"
    -k
    #Change to stderr to see errors

Run jobs on a remote Spark cluster using Livy

  1. Initialize Sparkmagic:
    %load_ext sparkmagic.magics
    from dsx_core_utils import proxy_util
    proxy_util.configure_proxy_livy()
  2. Start a remote Spark Livy session:
    %spark add -s session1 -l python -u
    https://9.87.654.321:8443/gateway/dsx/livy/v1 -k
  3. Query Hive:
    %%spark -s session1 -c sql -o local_drivers 
    select * from tpch_text_2.customer limit 50
  4. Submit the Spark job:
    %%spark -s session1 
    import random
    NUM_SAMPLES = 100000
    
    def sample(p):
        x, y = random.random(), random.random()
        return 1 if x*x + y*y < 1 else 0
    
    count = sc.parallelize(xrange(0,
    NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
    print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

Connect to a remote Spark in an HDP cluster using Alluxio

The following Jupyter notebook code example creates the Spark driver, runs it inside the notebook on Watson Studio Local, and connects to the remote Hadoop cluster:

import dsx_core_utils
dsx_core_utils.get_dsxhi_info(True)
!ls /dbdrivers
dsx_core_utils.upload_hdfs_file(source_path="/dbdrivers/alluxio-1.8.0-client.jar", 
   target_path="/user/user1/alluxio-1.8.0-client.jar",
   webhdfsurl="https://9.87.654.321:8443/gateway/9.87.654.231/webhdfs/v1")
myConfig={"queue": "default",
   "driverMemory":"2G",
   "numExecutors":2,
   "jars":["/user/user1/alluxio-1.8.0-client.jar"]
   }
dsx_core_utils.setup_livy_sparkmagic(
    system="hdfssystem",
    livy="livyspark2",
    addlConfig=myConfig)
%reload_ext sparkmagic.magics
%spark cleanup
%spark add -s session01 -l python -u https://asgardian-edge.fyre.ibm.com:8443/gateway/9.87.654.231/livy2/v1 -k
%spark info
%%spark
spark
%%spark
data=spark.read.csv("alluxio://soup1.fyre.ibm.com:19998/tennis.csv").show()
%spark cleanup

Learn more

See the sample notebooks for examples on how to connect to a remote Spark from your Watson Studio Local notebook by using Sparkmagic.