Table of contents

Run jobs in the background

A project editor or administrator can schedule a Python job to run asynchronously in the background, either on demand or at regular intervals, to run a source file such as a notebook or script. Also, jobs can be run to score and evaluate models.


Want to see job scheduling in action? Watch this short video:

Figure 1. Video iconJob Scheduling in Data Science Experience Local
This video shows how you can schedule a job in DSX Local, either on demand or at regular intervals, to run source files such as notebooks or scripts.

To create the job, go to your project home, click Jobs, and click create job. Specify a name and description. A job name can contain alphanumeric characters in upper or lower case, spaces, and dashes. Other special characters are not allowed.

Create job

For Type, you can select one of the following options:

  • Batch scoring (requires you to select a .ipynb notebook or .py script from your project assets)
  • Model evaluation (requires you to select a .ipynb notebook or .py script from your project assets)
  • Notebook run (requires you to select a notebook from your project assets)
  • Script run (requires you to select a .py script from your project assets)
  • SPSS Stream (requires you to select a stream from your project assets)

Workers are Docker environments that identify the specific runtime needed for the job to run. When a job run is triggered, a container (as defined by the worker) is brought up and used as the virtual environment to run the job. If two jobs are run at the same time, then they will each get a separate Worker container to run in.

If the DSX administrator registered an HDP cluster, and the worker you selected is available on it, then you can select that cluster as the Target host to run the job on. You can also click the settings icon next to the target host to edit the values.

Schedule how often to run the job (based on UTC time) and click the Create button. From either the job list or the job details, you can click Run now to start the job immediately, or reschedule when the job runs. In the job details, you can click each run to view its result, logs, duration, environment variables, and command line arguments.

As each run completes, you can view the run history in the job details page. If a run does not finish before the next scheduled interval, then the next scheduled interval will wait until the current run completes. In the run details page, you can view the logs. You can also manually rerun the job between its scheduled intervals. If you happen to run the job at the same time as a scheduled interval, they both runs will proceed concurrently.

Restriction: Because different collaborators might have different versions of the project, you might not be able to run someone else's job successfully.

To stop a job run, click Stop next to it. The job will exit immediately with a Failure status, freeing any reserved resources. DSX Local will preserve the history of the run, including logs.

Script run tip: In the Environment variables field, you can specify which Spark version to use, for example, SPARK_VERSION=2.2, SPARK_VERSION=2.1, or SPARK_VERSION=2.0. Restriction: If you select a GPU worker for the job, you cannot specify a Spark version.

If you change the SPARK_VERSION environment variable, ensure you initialize the Spark context in your script:

import pyspark
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)

where masterUrl represents the URL to create the SparkContext, for example, spark://spark-master-svc:7077 for Spark 2.0.2 and spark://spark-master221-svc:7077 for Spark 2.2.1.

DSX Local automatically sets environment variables such as DSX_TOKEN, DSX_USER_ID, DSX_PROJECT_NAME, and DSX_PROJECT_DIR. For example, if you use DSX Local to create a batch scoring script for you, you might see:

header_online = {'Content-Type': 'application/json', 'Authorization': os.environ['DSX_TOKEN']}
scoring_path = 'http://dsx-scripted-ml-python2-svc.dsxl-ml:7300/api/v1/score/unpublished/{0}/{1}'.format(os.environ['DSX_PROJECT_NAME'],'CarsModel')

When specifying assets such as CSV files for a script's command line arguments, you can use either full path (/user-home/<uid>/DSX_Projects/<project_name>/datasets/<asset>) or a path relative from the user home (~/DSX_Projects/<project_name>/datasets/<asset>).

If you need more compute performance for a job, you can edit the CPU and memory resources for a worker from the Workers page in your project (the changes would only be effective for subsequent job runs that are assigned to use that Worker environment). Workers consume resources (CPU/RAM) only for the duration of each job run, while Environments typically host services such as notebook servers or IDEs and require users to explicitly stop them to release the resources that they consume.

Caution: Turning Reserve resources off on a worker can affect system performance if available CPU and RAM on the servers become overcommitted.