Table of contents

Protect Hive data and access it securely from Watson Studio Local

You can protect Hive data that resides in a secure CDH cluster and access it securely from Watson Studio Local.

Before you start

  1. Set up a CDH cluster.
  2. Enable Kerberos for the CDH cluster.
  3. Add a Sentry service to the CDH cluster and associate Hive and HDFS with the Sentry service.
  4. Set up a Watson Studio Local Version 1.2.0 or later cluster.
  5. Configure the Watson Studio Local Hadoop integration service.
  6. Ensure that every Watson Studio Local user has a Linux user associated on all nodes of the CDH cluster, there is a HDFS directory under /user for that user, and the user has a Kerberos principal.

Protect Hive data residing in secure CDH cluster

Before you proceed, the Hadoop admin must ensure that the HDFS permissions of the Hive warehouse are locked down.

hdfs dfs -chown -R hive:hive
/user/hive/warehouse
hdfs dfs -chmod -R 0771 /user/hive/warehouse

Use HDFS ACLs in order to restrict the select or insert access to selected users or groups to specific Hive tables or views.

The example below shows how to restrict select access to user bob on the dsx_emp table.

hdfs dfs -setfacl -m user:bob:r-x
/user/hive/warehouse/dsxdevdb.db
hdfs dfs -setfacl -R -m user:bob:r-x
/user/hive/warehouse/dsxdevdb.db/dsx_emp

Securely access Hive data from Watson Studio Local using remote Spark Livy session

Using the following example Jupyter notebook as guidance, a Watson Studio Local user can list the available Livy endpoints, create a Livy session and run Hive queries using that session.

The notebook runs authenticated as a Watson Studio Local user bob. User bob has been granted HDFS ACLs on the underlying data files that make up the dsx_emp table, but not any other table. This is why he can run queries against the dsx_emp table, but not the dsx_dept table.

When Spark accesses a Hive table or view over a remote Livy session, it must have privileges on the underlying data files that make up that table or view, as it directly accesses the Hive metastore, which does not have a Sentry binding.

import dsx_core_utils
dsx_core_utils.setup_livy_sparkmagic()
dsx_core_utils.list_dsxhi_livy_endpoints()
success configuring sparkmagic livy.
['https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1']
%spark add -s bob-session -l python -u
https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1
Added endpoint
https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1
Starting Spark application

ID YARN Application ID Kind State Spark UI Driver log Current session? 28 application_1525906421853_0031 pyspark idle Link Link ✔

SparkContext available as 'sc'.
HiveContext available as 'sqlContext'.
%%spark -c sql
select * from dsxdevdb.dsx_emp
  serialno
0 1
1 2
2 3
%%spark -c sql
select * from dsxdevdb.dsx_dept
An error was encountered:
An error occurred while calling o85.partitions.
: org.apache.hadoop.security.AccessControlException: Permission
denied: user=bob,
access=READ_EXECUTE,
inode="/user/hive/warehouse/dsxdevdb.db/dsx_dept":hive:hive:drwxrwx--x