Table of contents

Run a Spark application in DSX Local

An application developer can submit a Spark application to run asynchronously within the DSX Local cluster by using REST APIs in https://9.87.654.321/api/v1/spark/ (where 9.87.654.321 represents the master node IP).

Requirement: A bearer token is needed to authenticate into your user home directory.

Restriction: Spark applications that are written in Java and Scala are supported. Python Spark applications are supported (Beta) on the condition that the application has no dependent files.

Tasks you can do:

Want to see submitting external applications to the DSX Local Spark service in action? Watch this short video:

Figure 1. Video iconSubmitting external applications to the Spark service in IBM DSX Local
This video walks you through the process of using the IBM DSX Local REST APIs to submit external applications to the DSX Local Spark service.

Upload the Spark application to your user home directory

You must upload the Spark application to spark/apps/ or a subdirectory within it in your user home directory.

The following example uses the file management REST API to upload spark-examples_2.11-2.0.2.jar to its own jars/ directory in user-home/spark/apps/.

curl -k -H "authorization: Bearer $bearerToken" \
-H "Content-Type: multipart/form-data" -F "file=@spark-examples_2.11-2.0.2.jar" \
-X POST https://9.87.654.321/api/v1/filemgmt/user-home/spark/apps/jars/

Restriction: If the Spark application depends upon other JAR files, the DSX administrator must upload these dependent files directly into master node, then import the files onto the system at the global level by running the /wdp/utils/importJarToClasspath.sh jarfile command where jarfile represents the fully qualified path to the dependent JAR file.

Run the Spark application

The following endpoint submits and starts the Spark application that you uploaded to DSX Local:

POST https://9.87.654.321/api/v1/spark/submit

Parameters:

appPath
Path to the Spark application that start from the user-home/spark/apps/ directory.
className
Main class name that is optional to run.
appArgs
Arguments that are optional to the application, passed as-is as a string.

The following example runs the application to compute PI by using 1000 slices.

curl -i -k -X POST https://9.87.654.321/api/v1/spark/submit \
  -H "authorization: Bearer $bearerToken" \
  -H "content-type: application/json" \
  -d '{ "appPath": "jars/spark-examples_2.11-2.0.2.jar", "className": "org.apache.spark.examples.SparkPi", "appArgs": "1000"}'

The POST request returns a JSON response to indicate whether the submission is successful, and a jobId for the application:

HTTP/1.1 202 Accepted
Server: openresty
Date: Wed, 28 Jun 2017 20:28:26 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 49
Connection: keep-alive
X-Powered-By: Express
ETag: W/"31-MaN3EcU0ECp4Qk06bbOCnIO02gY"

{"success":true,
"jobId":"1498681704348-1007-wu9"}

The value of the jobId field can be stored into a variable that can be used for operations that target a specific job, for example:

jobId=`curl -k -X POST https://9.87.654.321/api/v1/spark/submit \
  -H "authorization: Bearer $bearerToken" \
  -H "content-type: application/json" \
  -d '{ "appPath": "jars/spark-examples_2.11-2.0.2.jar", "className": "org.apache.spark.examples.SparkPi", "appArgs": "1000"}' \
 | jq -r ".jobId"`

The following table describes what is returned from the spark/submit request for various scenarios:

Scenario HTTP Status Code JSON .success JSON .jobId JSON. error
Submission request was accepted 202 true The ID of the job n/a
Submission request not accepted due to a Bad Request, for example, when one of the mandatory arguments like appPath is not provided 400 false n/a Error message
Authentication error 401 false n/a "Authentication failed."

Important: Status code 202 only indicates that the submission was accepted and that a job was queued, not that the application ran successfully. To determine success, you must query the application status and application logs.

Providing a local file input to spark applications

If you have a local file in /user-home/<uid>/spark/apps, you can use the absolute path in appArgs field to provide the file as input to your Spark application. See Retrieving a bearer token to get the uid and the absolute path. The following example shows how to refer to /user-home/1003/spark/apps/mlib/text8_lines as a local input file your Sparks application:

curl -i -k -X POST https://9.87.654.321/api/v1/spark/submit \
  -H "authorization: Bearer $bearerToken" \
  -H "content-type: application/json" \
  -d '{ "appPath": "spark-examples_2.11-2.0.2.jar", "className": "org.apache.spark.examples.DFSReadWriteTest", "appArgs": "/user-home/1003/spark/apps/mlib/text8_lines /user-home/1003/spark/apps/mlib" }'

Check the status of the Spark application

The following endpoint returns the status of the Spark application that you ran on DSX Local:

GET https://9.87.654.321/api/v1/spark/status

Parameters:

jobId
Job ID from the Spark application you ran.

Example:

curl -i -k -X GET -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/spark/status?jobId=$jobId

This GET request returns a JSON response that indicate the status of the application:

HTTP/1.1 200 OK
Server: openresty
Date: Sat, 24 Jun 2017 03:24:50 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 67
Connection: keep-alive
X-Powered-By: Express
ETag: W/"43-CnLkCzCDWMduQDeFf0G8ittVjlM"

{"success":true,
 "status":"Completed"}

The following table describes what is returned from the spark/status request for various scenarios:

Scenario HTTP Status Code JSON .success JSON .status JSON .error
Regular request with job status that is returned in response 200 true The status of the job n/a
Request with missing jobId 400 false n/a "Query field "jobId" is not provided."
Request with invalid jobId 400 false n/a "No resources found."
Authentication error 401 false n/a Error message

Job statuses:

Completed
The job that runs the application is completed. To determine whether the application ran successfully, you must query the application status and application logs.
Failed
The service failed to run the job.
Running
The job is still running.
Terminating
The job ended due to a cancellation request.

Retrieve Spark application output

The output and error logs from a submitted Spark application are written to dedicated files on the DSX Local cluster, which is organized by the jobId of the submitted application. You can download these files from your user home directory by using the file management REST API.

The following example lists the log files in your user home directory:

curl -k -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/filemgmt/user-home/spark/logs/

The following example return shows that the log directory contains the log files for three jobs:

[ "1498681704348-1007-wu9/", "1498714662603-1018-6t7/", "1498714745802-1018-x5s/" ]

The following example lists the content of a particular jobID directory:

curl -k -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/filemgmt/user-home/spark/logs/$jobId

The following example return shows four log files:

  • driver.stderr contains the standard error of the application
  • driver.stdout contains the standout output of the application
  • submission.stderr contains the standard error of the job that submits the application
  • submission.stdout contains the standard output of the job that submits the application
["driver.stderr","driver.stdout","submission.stderr","submission.stdout"]

The following example retrieves the content of driver.stdout from the cluster, redirects it to a local file system file, and then display the contents of the file:

curl -k -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/filemgmt/user-home/spark/logs/$jobId/driver.stdout > result.out
cat result.out
Pi is roughly 3.141768911417689

Cancel a Spark application job

The following endpoint cancels a Spark application job:

POST https://9.87.654.321/api/v1/spark/cancel

Parameters:

jobId
Job ID from the Spark application you ran.

Example:

curl -i -k -X POST -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/spark/cancel?jobId=$jobId

This POST request returns a JSON response that indicate whether the cancellation succeeded:

HTTP/1.1 200 OK
Server: openresty
Date: Sat, 24 Jun 2017 03:25:08 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 98
Connection: keep-alive
X-Powered-By: Express
ETag: W/"62-knxtda+nYcb/HBlin77E8UekR5E"

{"success":true,
 "jobId":"1498685899328-1004-su1"}

The following table describes what is returned from the spark/cancel request for various scenarios:

Scenario HTTP Status Code JSON .success JSON .status JSON .error
Successfully canceled a job 200 true The ID of the job that was canceled n/a
Request with missing jobId 400 false n/a "Query field "jobId" is not provided."
Request with invalid jobId 400 false n/a "Error in killing the pod."
Authentication error 401 false n/a Error message

Review the Spark submission

When a Spark application is submitted to DSX Local, it is only queued up to be processed asynchronously. A successful response from the spark/submit endpoint does not mean that the application ran successfully, as the application might fail at a later time.

You can query the driver.stderr file for the application job to determine whether any errors that occurred during the submission. The following example that shows a submission of a Spark application that will eventually fail and log an error into driver.stderr because the application file name was incorrect.

curl -i -k -X POST https://9.87.654.321/api/v1/spark/submit \
  -H "authorization: Bearer $bearerToken" \
  -H "content-type: application/json" \
  -d '{ "appPath": "jars/wrongFileName.jar", "className": "org.apache.spark.examples.SparkPi", "appArgs": "10000"}' 
HTTP/1.1 202 Accepted
Server: openresty
Date: Mon, 26 Jun 2017 01:26:03 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 83
Connection: keep-alive
X-Powered-By: Express
ETag: W/"53-YlhrAQfP4LqR1/sPzkpY9f6Nir8"
{"success":true,"jobId":"1498715368617-1018-zfh"}

jobId=1498715368617-1018-zfh

curl -i -k -X GET -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/spark/status?jobId=$jobId
HTTP/1.1 200 OK
Server: openresty
Date: Mon, 26 Jun 2017 01:26:27 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 69
Connection: keep-alive
X-Powered-By: Express
ETag: W/"45-WJ4NDXRIgV72BRkqRHL0i+7QS48"
{"success":true,"status":"Completed"}

curl -k -H "authorization: Bearer $bearerToken" \
  https://9.87.654.321/api/v1/filemgmt/user-home/spark/logs/$jobId/driver.stderr 
Warning: Local jar /user-home/1010/spark/apps/jars/wrongFileName.jar does not exist, skipping.
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:693)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Learn more