Protect Hive data and access it securely from Watson Studio Local
You can protect Hive data that resides in a secure CDH cluster and access it securely from Watson Studio Local.
Before you start
- Set up a CDH cluster.
- Enable Kerberos for the CDH cluster.
- Add a Sentry service to the CDH cluster and associate Hive and HDFS with the Sentry service.
- Set up a Watson Studio Local Version 1.2.0 or later cluster.
- Configure the Watson Studio Local Hadoop integration service.
- Ensure that every Watson Studio Local user has a Linux user associated
on all nodes of the CDH cluster, there is a HDFS directory under
/userfor that user, and the user has a Kerberos principal.
Protect Hive data residing in secure CDH cluster
Before you proceed, the Hadoop admin must ensure that the HDFS permissions of the Hive warehouse are locked down.
hdfs dfs -chown -R hive:hive /user/hive/warehouse hdfs dfs -chmod -R 0771 /user/hive/warehouse
Use HDFS ACLs in order to restrict the select or insert access to selected users or groups to specific Hive tables or views.
The example below shows how to restrict select access to user
bob on the
hdfs dfs -setfacl -m user:bob:r-x /user/hive/warehouse/dsxdevdb.db hdfs dfs -setfacl -R -m user:bob:r-x /user/hive/warehouse/dsxdevdb.db/dsx_emp
Securely access Hive data from Watson Studio Local using remote Spark Livy session
Using the following example Jupyter notebook as guidance, a Watson Studio Local user can list the available Livy endpoints, create a Livy session and run Hive queries using that session.
The notebook runs authenticated as a Watson Studio Local user
bob has been granted HDFS ACLs on the underlying data
files that make up the
dsx_emp table, but not any other table. This is why he can
run queries against the
dsx_emp table, but not the
When Spark accesses a Hive table or view over a remote Livy session, it must have privileges on the underlying data files that make up that table or view, as it directly accesses the Hive metastore, which does not have a Sentry binding.
import dsx_core_utils dsx_core_utils.setup_livy_sparkmagic() dsx_core_utils.list_dsxhi_livy_endpoints()
success configuring sparkmagic livy. ['https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1']
%spark add -s bob-session -l python -u https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1
Added endpoint https://cdh-outsider1.ibm.com:8443/gateway/9.87.654.321/livy/v1 Starting Spark application
ID YARN Application ID Kind State Spark UI Driver log Current session? 28 application_1525906421853_0031 pyspark idle Link Link ✔
SparkContext available as 'sc'. HiveContext available as 'sqlContext'.
%%spark -c sql select * from dsxdevdb.dsx_emp
%%spark -c sql select * from dsxdevdb.dsx_dept
An error was encountered: An error occurred while calling o85.partitions. : org.apache.hadoop.security.AccessControlException: Permission denied: user=bob, access=READ_EXECUTE, inode="/user/hive/warehouse/dsxdevdb.db/dsx_dept":hive:hive:drwxrwx--x