Table of contents

Add data sources and remote data sets

A data source provides a secure mechanism to store and manage credentials for a database, as opposed to storing them in files managed by the notebook or RStudio tools. It also makes a password easier to change because it can be updated in one secure location, as opposed to every file where that database is referenced.

Remote data sets can be added to each data source. Remote data sets connect directly to a schema and table name, and can be accessed directly from DSX Local APIs, machine learning models, and flows. When you collaborate on a project, you can opt to create remote data sets for the most relevant and useful tables from the data source. They can also be loaded into machine learning models and flows.

In DSX Local, you can create data sources and remote data sets for the following on-premises databases and services:

  • Big SQL
  • Db2 Warehouse on Cloud
  • Db2 for Linux, UNIX, and Windows
  • DB2 for z/OS
  • Hive for HDP
  • HDFS for HDP
  • Hive for Cloudera (CDH)
  • HDFS for Cloudera (CDH)
  • Informix
  • Netezza
  • Oracle

To access data from a database that is not on this list, see Import a JDBC driver and Access data in relational databases.

Tasks that you can do:

Alternatively, if you opt not to add data sources, you can load data directly from the relational databases by using Access data in relational databases.

Create a data source and remote data set

Any collaborator with data source information and credentials can create a data source in the project. To add a data source to a project:

  1. From your project, go to the Data Sources page and click add data source.
  2. In the Add Data Source window, specify all of the required fields.

    Shows the Add Data Source panel

  3. Select the Data Source Type, then use the following format for the URL field:

On-premise database or service JDBC URL format Example
Db2 for Linux, UNIX, and Windows jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://9.87.654.321:50000/SAMPLE:user=DB2INST1;
Db2 with SSL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}:sslConnection=true;sslTrustStoreLocation={TRUSTSTOREPATH} jdbc:db2://9.87.654.321:50001/SAMPLE:sslConnection=true;sslTrustStoreLocation=/user-home/_global_/security/customer-truststores/cacerts;
Db2 for z/OS jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://dtec649.vmec.ibm.com:446/STLEC2;
Db2 Warehouse on Cloud jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}; jdbc:db2://dashdb-entry-yp-dal09-07.services.dal.bluemix.net:50000/BLUDB;
Netezza jdbc:netezza://{HOSTNAME}:{PORT}/{DATABASE} jdbc:netezza://9.87.654.321:5480/SYSTEM;
Informix jdbc:informix-sqli://{HOSTNAME}:{PORT}/{DATABASE}:INFORMIXSERVER={SERVERNAME};DELIMIDENT=y jdbc:informix-sqli://9.87.654.321:9088/stores_demo:INFORMIXSERVER=dev;DELIMIDENT=y;
Oracle jdbc:oracle:thin:{USERNAME}/{PASSWORD}@{HOSTNAME}:{PORT}:{SID} jdbc:oracle:thin:system/oracle@9.87.654.321:1521:xe
Big SQL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE} jdbc:db2://9.87.654.321:32051/bigsql
Big SQL with SSL jdbc:db2://{HOSTNAME}:{PORT}/{DATABASE}:sslConnection=true;sslTrustStoreLocation={TRUSTSTOREPATH} jdbc:db2://other2.fyre.ibm.com:51000/bigsql:sslConnection=true;sslTrustStoreLocation=/user-home/global/security/customer-truststores/cacerts;
MS SQL (custom) jdbc:sqlserver://{HOSTNAME}:{PORT};databaseName={DBNAME} jdbc:sqlserver://9.30.54.105:1433;databaseName=master
On-premise database or service HDFS host HDFS port WebHDFS URL
HDFS - HDP 9.87.654.323 8020 https://9.87.654.323:8443/gateway/dsx/webhdfs/v1
HDFS - CDH 9.87.654.323 8020 https://9.87.654.323:8443/gateway/dsx/webhdfs/v1
On-premise database or service WebHCAT URL Secure WebHCAT URL WebHDFS URL Livy URL
Hive - HDP http://sloe3.ibm.com:50111/templeton/v1/ https://9.87.654.323:8443/gateway/dsx/templeton https://9.87.654.323:8443/gateway/dsx/webhdfs/v1 https://9.87.654.323:8443/gateway/dsx/livy/v1
Hive - CDH http://sloe3.ibm.com:50111/templeton/v1/ https://9.87.654.323:8443/gateway/dsx/templeton https://9.87.654.323:8443/gateway/dsx/webhdfs/v1 https://9.87.654.323:8443/gateway/dsx/livy/v1
  1. If you select the Shared check box, the data source and its credentials become accessible to all collaborators in the project. If you do not select Shared, then each collaborator must provide their own credentials to use the data source. Important: When a data source is shared, it cannot be made private again. Restriction: A remote data source cannot be shared for an external Git project.

You can add or edit remote data sets that connect directly to a schema and table name in the data source. Notebooks within the project can then refer to each remote data set by a name to retrieve its information. You can either define remote data sets when you create the data source, or add them later.

All collaborators in a project can use a shared data source in their notebook, or add their own private data source. Collaborators with viewer privileges cannot edit data sources.

Big SQL requirement: In a Zeppelin notebook, you must set up your Big SQL interpreter to work within Zeppelin.

Prerequisite for SSL for Db2 and Big SQL LUW

If you plan to use SSL for a Db2 for Linux, UNIX, and Windows or a Big SQL connection that uses a self-signed certificate or a certificate that is signed by a local certificate authority (CA), you need to import the SSL certificate to the Spark truststore:

  1. Export the database server self-signed certificate or local CA certificate to a file or contact your database administrator for this certificate file.
  2. Import the certificate to the Spark truststore and mark it as trusted. You can do this step with a script, /wdp/utils/idpKeyMan.sh, which is provided on the master nodes of the cluster. The following command runs the script:
    /wdp/utils/idpKeyMan.sh -truststore -import -alias <alias_name> -file <certificate_path> -storepass <truststore_password>
    
    where <certificate_path> represents the fully qualified path to the file with the certificate that you exported in the previous step, <alias_name> is the name that you can choose for your certificate entry in the truststore, and <truststore_password> by default is changeit.
  3. Restart the kernels for any notebooks that are open to ensure that the latest version of the truststore is accessed.

To change the password for the truststore, you can use the -storepasswd flag:

/wdp/utils/idpKeyMan.sh -truststore -storepasswd -storepass <truststore_password> -new <new_passwd>

where <truststore_password> is the current password and <new_passwd> is the new password.

Append the following to your JDBC URL in the data source definition by editing the data source:

:sslConnection=true;sslTrustStoreLocation=/user-home/_global_/security/customer-truststores/cacerts;

Load data from a remote data set

To automatically load data into a data frame in a notebook, open the notebook and click the Find and Add Data icon (Shows the Find and Add Data icon) in the toolbar. Then, click the Remote tab. Find the data set that you want, and click Insert to code.

Shows the Find and Add Data panel

The code to import the data into a data frame is added to your notebook. For example:

import dsx_core_utils
dataSet = dsx_core_utils.get_remote_data_set_info('#{fileName}')
dataSource = dsx_core_utils.get_data_source_info(dataSet['datasource'])

Access data in a data source and remote data sets with APIs

To load data from a remote data set into a data frame in a notebook, you can use DSX Local data source APIs. Limitation: Scala is not supported.

Use the get_data_source_info and get_remote_data_set_info APIs to retrieve information about a data source and remote data set, respectively:

Python syntax

import dsx_core_utils
<DATASOURCE DICTIONARY> = dsx_core_utils.get_data_source_info(<DATASOURCE NAME>)
<DATASET DICTIONARY> = dsx_core_utils.get_remote_data_set_info(<REMOTEDATASET NAME>)

where <DATASOURCE NAME> is the data source name of string type, and <REMOTEDATASET NAME> is the remote data set name of string type. The remote data set contains a table name and schema name.

R syntax

library(dsxCoreUtilsR)
<DATASOURCE DICTIONARY> = getDataSourceInfo(<DATASOURCE NAME>)
<DATASET DICTIONARY> = getRemoteDataSetInfo(<REMOTEDATASET NAME>)

where <DATASOURCE NAME> is the data source name of string type, and <REMOTEDATASET NAME> is the remote data set name of string type. The remote data set contains a table name and schema name.

The get_data_source_info API returns a dictionary<DATASOURCE DICTIONARY> that contains the following unordered variables:

Variable name Descripton
URL The JDBC URL
description Description of the data source
driver_class Name of the database driver class, for example, com.informix.jdbc.IfxDriver
dsx_artifact_type The value is "datasource"
name Name of the data source
password Password for the user
shared True or False; indicates whether the credential is used by all collaborators in the project
type Name of the database vendor
user Username

The get_remote_data_set_info API returns a dictionary<DATASET DICTIONARY> that contains the following unordered variables:

Variable Name Description
datasource Name of the associated data source
description Description of the remote data set
dsx_artifact_type The value is "remotedataset"
name Label of the remote data set
noun The value is "table"
schema Schema of the table
table Table name

Python example

import dsx_core_utils
ds = dsx_core_utils.get_data_source_info("informix")
dataset = dsx_core_utils.get_remote_data_set_info("informix_cars")

Example Python data source output:

{u'URL': u'jdbc:informix-sqli://127.0.0.1:9088/stores_demo:INFORMIXSERVER=dev;DELIMIDENT=y;user=informix;password=inmixpass',
 u'description': u'informix',
 'driver_class': 'com.informix.jdbc.IfxDriver',
 u'dsx_artifact_type': u'datasource',
 u'name': u'informix',
 u'password': u'oracle',
 u'shared': True,
 u'type': u'informix',
 u'user': u'system'}

Example Python remote data set output:

{u'datasource': u'Oracledata',
 u'description': u'Employee Data',
 u'dsx_artifact_type': u'remotedataset',
 u'name': u'EmployeeTable',
 u'noun': u'table',
 u'schema': u'SCOTT',
 u'table': u'EMP'}

R example

library(dsxCoreUtilsR)
ds = getDataSourceInfo("dbdb")
dataset = getRemoteDataSetInfo("db3")

You can access variables by using ds$.

Example R data source output:

$dsx_artifact_type
  'datasource'
  $type
  'Db2'
  $description
  'Db2 description'
  $URL
  'db2://sysmvs1.stl.ibm.com:5021'
  $name
  'dbdb'
  $driverClassName
  'com.ibm.db2.jcc.DB2Driver'

Example R remote data set output:

$dsx_artifact_type
  'remotedataset'
  $name
  'db3'
  $description
  'string'
  $datasource
  'db2_sample'
  $noun
  'table'
  $schema
  'db2inst1'
  $table
  'car'

Sample notebook code for data source usage

Python example

To import modules as needed to connect to the data sources:

# Imports
import dsx_core_utils
import jaydebeapi

To set up modules for dataframes:

# Imports
import pandas as pd
from pyspark.sql import SQLContext, SparkSession
sparkSession = SparkSession.builder.master("local").appName("test").getOrCreate();

To read data source and remote data set information:

db2_dataSource = dsx_core_utils.get_data_source_info('db2Source')
db2_dataSet = dsx_core_utils.get_remote_data_set_info('db2Set')
db2_dbTableOrQuery = db2_dataSet["schema"] + "." + db2_dataSet["table"]

To connect to the data source and load the pandas data frame:

# jaydebeapi & pandas
print(db2_url)
conn = jaydebeapi.connect(db2_driverClass, [db2_dataSource["URL"], db2_dataSource["user"], db2_dataSource["password"]])
df = pd.read_sql('select * from ' + db2_dbTableOrQuery, con=conn)
print 'loaded dataframe from DB2 LUW (no ssl). # Records: ', len(df)

R examples

library(dsxCoreUtilsR)
library(SparkR)

dataSet <- getRemoteDataSetInfo("EmployeeTable")
dataSource <- getDataSourceInfo(dataSet$datasource)

dbTableOrQuery = paste(dataset$schema,dataset$table,sep=".")
df <- read.jdbc(data$URL, dbTableOrQuery, user=data$user, password=data$password)

take(df,count(df))
# Imports
library(dsxCoreUtilsR)
library(RJDBC)

# RJDBC
# DB2 LUW (no ssl)
data <- getDataSourceInfo('db2Source')
dataSet <- getRemoteDataSetInfo('db2Set')
conn_string=data$URL
drv <- JDBC()
conn <- dbConnect(drv, conn_string, data$user, data$password)
d <- dbReadTable(conn, paste(dataSet$schema, '.', dataSet$table, sep=""))
d