Table of contents

Add data sources and remote data sets

A data source provides a secure mechanism to store and manage credentials for a database, as opposed to storing them in files managed by the notebook or RStudio tools. It also makes a password easier to change because it can be updated in one secure location, as opposed to every file where that database is referenced.

Remote data sets can be added to each data source. Remote data sets connect directly to a schema and table name, and can be accessed directly from Watson Studio APIs, machine learning models, and flows. When you collaborate on a project, you can opt to create remote data sets for the most relevant and useful tables from the data source. They can also be loaded into machine learning models and flows.

In Watson Studio Local, you can create data sources and remote data sets for the following databases and services:

  • Big SQL
  • Db2 Warehouse on Cloud
  • Db2 for Linux, UNIX, and Windows
  • DB2 for z/OS
  • Hive for HDP
  • HDFS for HDP
  • Hive for Cloudera (CDH)
  • HDFS for Cloudera (CDH)
  • Hyperledger Composer
  • Informix
  • Microsoft SQL Server (MSSQL)
  • Netezza
  • Oracle

To access data from a database that is not on this list, see Import a JDBC driver and Access data in relational databases.

Tasks that you can do:

Alternatively, if you opt not to add data sources, you can load data directly from the relational databases by using Access data in relational databases.

Create a data source and remote data set

Any collaborator with data source information and credentials can create a data source in the project. To add a data source to a project:

  1. From your project, go to the Data Sources page and click add data source.
  2. In the Add Data Source window, specify all of the required fields.

    Shows the Add Data Source panel

  3. Select the Data Source Type, then use the following format for the URL field:
    On-premise database or service JDBC URL format Example
    Db2 for Linux, UNIX, and Windows jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://9.87.654.321:50000/SAMPLE:user=DB2INST1;
    Db2 with SSL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}:sslConnection=true;sslTrustStoreLocation={TRUSTSTOREPATH} jdbc:db2://9.87.654.321:50001/SAMPLE:sslConnection=true;sslTrustStoreLocation=/user-home/_global_/security/customer-truststores/cacerts;
    Db2 for z/OS jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://9.87.654.321:446/STLEC2;
    Db2 Warehouse on Cloud jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://9.87.654.321:50000/BLUDB;
    Netezza jdbc:netezza://{HOSTNAME}:{PORT}/{DATABASE} jdbc:netezza://9.87.654.321:5480/SYSTEM;
    Informix jdbc:informix-sqli://{HOSTNAME}:{PORT}/{DATABASE}:INFORMIXSERVER={SERVERNAME};DELIMIDENT=y jdbc:informix-sqli://9.87.654.321:9088/stores_demo:INFORMIXSERVER=dev;DELIMIDENT=y;
    Oracle jdbc:oracle:thin:{USERNAME}/{PASSWORD}@{HOSTNAME}:{PORT}:{SID} jdbc:oracle:thin:system/oracle@9.87.654.321:1521:xe
    Big SQL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE} jdbc:db2://9.87.654.321:32051/bigsql
    Big SQL with SSL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}:sslConnection=true;sslTrustStoreLocation={TRUSTSTOREPATH} jdbc:db2://9.87.654.321:51000/bigsql:sslConnection=true;sslTrustStoreLocation=/user-home/_global_/security/customer-truststores/cacerts;
    MSSQL jdbc:sqlserver://{HOSTNAME}:{PORT};databaseName={DBNAME} jdbc:sqlserver://9.87.654.321:1433;databaseName=master
    On-premise database or service HDFS host HDFS port WebHDFS URL
    HDFS - HDP 9.87.654.323 8020 https://9.87.654.323:8443/gateway/dsx/webhdfs/v1
    HDFS - CDH 9.87.654.323 8020 https://9.87.654.323:8443/gateway/dsx/webhdfs/v1
    On-premise database or service WebHCAT URL Secure WebHCAT URL WebHDFS URL Livy URL
    Hive - HDP https://9.87.654.323:8443/gateway/dsx/templeton https://9.87.654.323:8443/gateway/dsx/webhdfs/v1 https://9.87.654.323:8443/gateway/dsx/livy/v1
    Hive - CDH https://9.87.654.323:8443/gateway/dsx/templeton https://9.87.654.323:8443/gateway/dsx/webhdfs/v1 https://9.87.654.323:8443/gateway/dsx/livy/v1
    On-premise database or service URL endpoint API key
    Hyperledger Composer Only required if an API key is set on the Hyperledger Composer server.
  4. If you select the Shared check box, the data source and its credentials become accessible to all collaborators in the project. If you do not select Shared, then each collaborator must provide their own credentials to use the data source.
    Important: When a data source is shared, it cannot be made private again.
    Restriction: A remote data source cannot be shared for an external Git project.
  5. Click the Test Connection button to verify that the connection to the data source works.

You can add or edit remote data sets that connect directly to a schema and table name in the data source. Notebooks within the project can then refer to each remote data set by a name to retrieve its information. You can either define remote data sets when you create the data source, or add them later. You can also browse a schema for most table types.

To preview a data set, click Preview next to it.

All collaborators in a project can use a shared data source in their notebook, or add their own private data source. Collaborators with viewer privileges cannot edit data sources.

Big SQL requirement: In a Zeppelin notebook, you must set up your Big SQL interpreter to work within Zeppelin.

Prerequisite for SSL for Db2 and Big SQL LUW

If you plan to use SSL for a Db2 for Linux, UNIX, and Windows or a Big SQL connection that uses a self-signed certificate or a certificate that is signed by a local certificate authority (CA), you need to import the SSL certificate to the Spark truststore:

  1. Export the database server self-signed certificate or local CA certificate to a file or contact your database administrator for this certificate file.
  2. Import the certificate to the Spark truststore and mark it as trusted. You can do this step with a script, /wdp/utils/, which is provided on the master nodes of the cluster. The following command runs the script:
    /wdp/utils/ -truststore -import
    -alias <alias_name> -file <certificate_path> -storepass
    where <certificate_path> represents the fully qualified path to the file with the certificate that you exported in the previous step, <alias_name> is the name that you can choose for your certificate entry in the truststore, and <truststore_password> by default is changeit.
  3. Restart the kernels for any notebooks that are open to ensure that the latest version of the truststore is accessed.

To change the password for the truststore, you can use the -storepasswd flag:

/wdp/utils/ -truststore -storepasswd
-storepass <truststore_password> -new <new_passwd>

where <truststore_password> is the current password and <new_passwd> is the new password.

Append the following to your JDBC URL in the data source definition by editing the data source:


Load data from a remote data set

To automatically load data into a data frame in a notebook, open the notebook and click the Find and Add Data icon ( Shows the Find and Add Data icon) in the toolbar. Then, click the Remote tab. Find the data set that you want, and click Insert to code.

Shows the Find and Add Data panel

The code to import the data into a data frame is added to your notebook. For example:

import dsx_core_utils
dataSet = dsx_core_utils.get_remote_data_set_info('#{fileName}')
dataSource =

Access data in a data source and remote data sets with APIs

To load data from a remote data set into a data frame in a notebook, you can use Watson Studio Local data source APIs. Limitation: Scala is not supported.

Use the get_data_source_info and get_remote_data_set_info APIs to retrieve information about a data source and remote data set, respectively:

Python syntax
import dsx_core_utils
dsx_core_utils.get_data_source_info(<DATASOURCE NAME>)

where <DATASOURCE NAME> is the data source name of string type, and <REMOTEDATASET NAME> is the remote data set name of string type. The remote data set contains a table name and schema name.

R syntax

where <DATASOURCE NAME> is the data source name of string type, and <REMOTEDATASET NAME> is the remote data set name of string type. The remote data set contains a table name and schema name.

The get_data_source_info API returns a dictionary<DATASOURCE DICTIONARY> that contains the following unordered variables:

Variable name Descripton
description Description of the data source
driver_class Name of the database driver class, for example, com.informix.jdbc.IfxDriver
dsx_artifact_type The value is "datasource"
name Name of the data source
password Password for the user
shared True or False; indicates whether the credential is used by all collaborators in the project
type Name of the database vendor
user Username

The get_remote_data_set_info API returns a dictionary<DATASET DICTIONARY> that contains the following unordered variables:

Variable Name Description
datasource Name of the associated data source
description Description of the remote data set
dsx_artifact_type The value is "remotedataset"
name Label of the remote data set
noun The value is "table"
schema Schema of the table
table Table name
Python example
import dsx_core_utils
ds = dsx_core_utils.get_data_source_info("informix")
dataset =

Example Python data source output:

 u'description': u'informix',
 'driver_class': 'com.informix.jdbc.IfxDriver',
 u'dsx_artifact_type': u'datasource',
 u'name': u'informix',
 u'password': u'oracle',
 u'shared': True,
 u'type': u'informix',
 u'user': u'system'

Example Python remote data set output:

 u'datasource': u'Oracledata',
 u'description': u'Employee Data',
 u'dsx_artifact_type': u'remotedataset',
 u'name': u'EmployeeTable',
 u'noun': u'table',
 u'schema': u'SCOTT',
 u'table': u'EMP'
R example
ds = getDataSourceInfo("dbdb")
dataset = getRemoteDataSetInfo("db3")

You can access variables by using ds$.

Example R data source output:

  'Db2 description'

Example R remote data set output:


Sample notebook code for data source usage

Python example

To import modules as needed to connect to the data sources:

# Imports
import dsx_core_utils
import jaydebeapi

To set up modules for dataframes:

# Imports
import pandas as pd
from pyspark.sql import SQLContext, SparkSession
sparkSession =

To read data source and remote data set information:

db2_dataSource =
db2_dataSet = dsx_core_utils.get_remote_data_set_info('db2Set')
db2_dbTableOrQuery = db2_dataSet["schema"] + "." +

To connect to the data source and load the pandas data frame:

# jaydebeapi & pandas
conn = jaydebeapi.connect(db2_driverClass, [db2_dataSource["URL"],
db2_dataSource["user"], db2_dataSource["password"]])
df = pd.read_sql('select * from ' + db2_dbTableOrQuery, con=conn)
print 'loaded dataframe from DB2 LUW (no ssl). # Records: ',
R examples

The following R on Spark example is recommended for R/RStudio:

df1 <- NULL

# The following is an example for creating SparkSession, the entry point into SparkR.
# You can define your own spark session by modifying this example.
# When using spark 2.2.1, change the master url into "spark://spark-master221-svc:7077"

sc <- sparkR.session(master="spark://spark-master-svc:7077",

# Add data asset from remote connection
rmdsi = getRemoteDataSetInfo("oracle-rds")
dsi= getDataSourceInfo(rmdsi$datasource)
if (nchar(trimws(rmdsi$schema)) != 0) {
 dbTableOrQuery = paste(rmdsi$schema,rmdsi$table,sep=".")
} else {
 dbTableOrQuery = rmdsi$table
df1 <- read.jdbc(dsi$URL, dbTableOrQuery, user=dsi$user, password=dsi$password)
#  DB2 and DB2 on Cloud example
dbSchemaTable = paste(schema,table,sep=".")
returnQuery = paste("Select * from", dbSchemaTable)

drv <- JDBC("", "/dbdrivers/db2jcc4.jar")
conn <- dbConnect(drv, URL, user, password)
data <- dbSendQuery(conn, returnQuery)
# fetch first 5 rows
df <- dbFetch(data, n = 5)

Configure the ability to create shared data sources

Data scientists can create data sources with shared credentials. Those shared credentials can be used by all collaborators in a project to connect to the data source. For enterprises that do not allow credentials to be shared, the Watson Studio Local administrator can set a configuration property that will not allow data scientists to create data sources with shared credentials.
Important: You must have patch01 installed to take advantage of this enhancement. Go here, and then select wsl-x86-v1231-patch01-TS002078840-TS002078850 to get the enhancement.

To create data source shared credentials:

  1. Run kubectl get pods -n dsx and identify the utils-api pod.
  2. Run kubectl exec pod -n dsx to get into the pod.
  3. Create or edit the /user-home/_global_/config/ file and set DataSourceSharedCredentials=false. The default value for DataSourceSharedCredentials is true.