Table of contents

Connect remotely to an external Spark

Using HTTP REST APIs from the Apache Livy service, you can connect remotely to an external Spark from your Data Science Experience notebook.

Requirements:

  • You must install Apache Livy on the remote Spark site. Livy provides a URL to interact with the remote Spark.
  • You must install all the dependencies (any libraries and packages) for your Spark code on the remote Spark site.

See Apache Livy Examples for more details on how a Python, Scala, or R notebook can connect to the remote Spark site.

Tasks you can perform:

Want to see connecting to an external Spark service in action? Watch this short video:

Figure 1. Video iconConnecting to an external Spark service from IBM DSX Local
This video walks you through the process of writing notebooks in IBM DSX Local that remotely connect to an external Spark service with Livy using Sparkmagic.

To connect to the remote Spark site, create the Livy session (either by UI mode or command mode) by using the REST API endpoint. The endpoint must include the Livy URL, port number, and authentication type. Sparkmagic example:

  %spark add -s session1 -l python -u https://my.hdp.system.com:8443/gateway/default/livy/v1 -a u -k

where session1 represents the session name and python represents the notebook language.

You can enter %spark? or %%help for the list of commands.

If you want subsequent lines in your notebook to use the DSX Spark service, then you can specify %%local. Otherwise, the cell defaults to the remote Spark service. If you want data to be returned in the local notebook, use %%spark -o dd.

Afterward, ensure that you delete the Livy session to release the Spark resource:

%spark delete session1

Set the default Livy URL for DSX Local

A DSX administrator can configure a default Livy endpoint URL for DSX Local by completing the following steps:

  1. In a command shell, sign in to the DSX Local master node as root.

  2. Set the default Livy endpoint by running the /wdp/utils/add_endpoint.sh script. Example for an unsecured livy endpoint:

    add_endpoint.sh --livy-url=http://my.hdp.system.com:8998
    

    Example for a secure Livy endpoint that requires an SSL certificate:

    add_endpoint.sh --addcert --livy-url=https://my.hdp.system.com:8443/gateway/default/livy/v1
    

As a result, the script automatically creates a default_endpoints.conf file in the /user-home/_global_/config directory. The file contains the line:

livy_endpoint=https://9.87.654.321:8443/gateway/default/livy/v1

Alternatively, in the Admin Console, go to Scripts and in the Script pull-down menu, select Set the default Livy endpoint for DSX Local (add_endpoint.sh) to perform the same tasks.

add_endpoint.sh

Create a Livy session on a secure HDP cluster using JWT authentication

All Livy sessions initiated from DSX to a secure HDP cluster will authenticate using the JWT token belonging to the logged in DSX user. The proxy user for these sessions will be the DSX logged in user, and the HDP admin can monitor the jobs submitted by each DSX user.

Requirement: The HDP cluster must be set up to work with DSX Local, and DSX Local must be configured to work with the HDP cluster.

Zeppelin example

When Zeppelin starts up, all Livy interpreters are set up with the JWT token belonging to the user. Additionally, all Livy interpreters that do not have the zeppelin.livy.url property set, or have the default value, are updated to point to the Livy endpoint setup in /user-home/_global_/config/default_endpoints.conf.

To specifically update new interpreters created after the Zeppelin startup, run:

%python
import dsx_core_utils
dsx_core_utils.setup_livy_zeppelin()

Jupyter Python example

Set up:

%load_ext sparkmagic.magics
import dsx_core_utils
dsx_core_utils.setup_livy_sparkmagic()
%reload_ext sparkmagic.magics

Create a session from the Sparkmagic client:

%manage_spark

Jupyter R and RStudio example

To create a Spark context based on the default Livy endpoint:

library("dsxCoreUtilsR")
sc <- dsxCoreUtilsR::createLivySparkContext()

To create a Spark context based on a specific Livy endpoint:

library("dsxCoreUtilsR")
sc <- dsxCoreUtilsR::createLivySparkContext(livy_endpoint="https://9.87.654.323:8443/gateway/dsx/livy/v1")

To disconnect the Spark context:

dsxCoreUtilsR::disconnectLivySparkContext(sc)

Query Hive data set on an HDP cluster

To quickly access a small data set, you can use a webHCAT-based Hive query (If your dataset is large, use the HDP Spark through Livy instead):

  1. Access the authentication token:

    import os
    dsx_token  = os.environ['DSX_TOKEN']
    
  2. Issue the Hive query to the HDP cluster configured to work with DSX Local:

    %%bash -s "$dsx_token"
    curl -s -H "X-Requested-By: dsxuser" -H "Authorization: $1" -d execute='select+*+from+hiveris.drivers+limit+10;' \
    -d statusdir='dans2' \
    'https://9.87.654.321:8443/gateway/dsx/templeton/v1/hive?user.name=user1' -k
    

    This webHCAT request returns a job id, for example, {"id":"job_1512769862398_0157"}. In the cell, you can replace oto3.fyre.ibm.com with the name of your HDP cluster Hive templeton server hostname. The query might take some time to run. You should check the status of the query for completion by visiting the yarn cluster metrics dashboard, for example, http://9.87.654.321:8088/cluster. When the job completes, you can check the result. If the job does not complete, check the logs on the yarn cluster metrics and contact the HDP admin to resolve issues.

  3. Retrieve results of the query stored on HDFS:

    %%bash -s "$dsx_token"
    curl -i  -H "X-Requested-By: dsxuser" -H "Authorization: $1" -H "Content-Type: application/json" -L "https://<hadoophdfshostname>:8443/gateway/dsx/webhdfs/v1/user/user1/dans2/stdout?op=OPEN" -k
    #Change to stderr to see errors
    

Run jobs on a remote Spark cluster using Livy

  1. Initialize Sparkmagic:

    %load_ext sparkmagic.magics
    from dsx_core_utils import proxy_util
    proxy_util.configure_proxy_livy()
    
  2. Start a remote Spark Livy session:

    %spark add -s session1 -l python -u https://9.87.654.321:8443/gateway/dsx/livy/v1 -k
    
  3. Query Hive:

    %%spark -s session1 -c sql -o local_drivers 
    select * from tpch_text_2.customer limit 50
    
  4. Submit the Spark job:

    %%spark -s session1 
    import random
    NUM_SAMPLES = 100000
    
    def sample(p):
        x, y = random.random(), random.random()
        return 1 if x*x + y*y < 1 else 0
    
    count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
    print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
    

Learn more

See the sample notebooks for examples on how to connect to a remote Spark from your DSX notebook by using Sparkmagic.