Table of contents

Troubleshooting common Watson Studio Local problems

The following section provides various methods, scripts, and support for troubleshooting common Watson Studio Local problems.

Symptom: When you execute conda install <> inside the GPU pods for a custom image, the pods remain stuck in a running state and don't come back to a ready state. The environment then fails to come up.

This issue occurs when you're running conda install <> and the Python version is updated, which then creates a conflict with conda. As a result, the Juptyer server will not start.

To work around the issue:

  1. Stop and restart the environment.
  2. Run the conda install and specify that conda install gensim python=3.6.8. Specifying the version prevents Python from being upgraded and allows the Jupyter server to start.

Symptom: Jupyter instances running inside Watson Studio Local were discoverable by piecing together user-id and other information. Any authenticated user then could interact with that Jupyter instance by following the constructed URL

To fix this problem, apply Patch 07.

Symptom: The api/v1/usermgmt/v1/usermgmt/usersbystatus endpoint returns a password hash value

To fix this problem, apply Patch 07.

Symptom: An asset that is running fine within a Watson studio project, but then returns a FileNotFoundError exception when it's running inside a project release in Watson Machine Learning.

This problem could happen if the directory mentioned in the exception is empty in the project. The directory is then not committed and pushed to Git and thereby missing from the project release. To fix the issue, create a .keep file in the directory within the project and do a commit and push.

Symptom: An "Invalid Access Token” error is returned when a Github/Bitbucket access token contains a “/” or other special characters.

To fix this problem, apply Patch 06.

Symptom: Updating the tag for a project release that is created from a Github/Bitbucket repo causes your system to hang.

To fix this problem, apply Patch 06.

Symptom: The following error occurs when you use the graphviz package within a Jupyter notebook

FileNotFoundError: [Errno 2] No such file or directory: 'dot'
ExecutableNotFound: failed to execute ['dot', '-Tpng'], make sure the Graphviz executables are on your systems' PATH

To fix this problem for Jupyter 2.7 and Jupyter 3.5 environments, apply Patch 06.

Symptom: When certain operations within a project fail, user-sensitive information is logged in the error message.

To fix this problem, apply Patch 06.

Symptom: One of the docker images includes scripts that have hardcoded credentials in the script files.

To fix this problem, apply Patch 06.

Symptom: The startup scripts of certain user pods run copy operations as root and can lead to a security vulnerability.

To fix this problem, apply Patch 06.

Symptom: An authentication error is returned when the user name provided when a token is created for Github/Bitbucket access and it contains “.”, “@” or “\”

To fix this problem, apply Patch 06.

Symptom: For Watson Studio Local 1.2.3.1 x86, from User settings, LDAP users could change their password, which could corrupt a user's LDAP profile.

To fix this problem, apply patch05. The change password option is now removed for LDAP users after you apply the patch.

The key file in a user's home directory is used to encrypt the user's credentials. In Watson Studio Local 1.2.3.1 x86, this file had read permissions for all users causing a security concern.

To fix this problem, apply patch05. The directory permission for a user's home directory is now only readable by the user for newly created users after you apply the patch. The patch will not affect the permissions of the user's home directory for existing users. These should be changed manually.

For Watson Studio Local 1.2.3.1 x86, On the Environments tab in the Projects page, the link to the terminal for Jupyter with Python 3.5 for GPU opens the incorrect terminal.

To fix this problem, apply patch05. The link is fixed after you apply the patch.

For Watson Studio Local 1.2.3.1 x86, in the Projects page, within the Assets tab under Notebooks, the link to the Environments always opens the Jupyter with Python 2.7, Scala 2.11, R 3.4.3, Spark 2.0.2 environments.

To fix this problem, apply patch05. The links to open the appropriate environment are fixed after you apply the patch.

For Watson Studio Local 1.2.3.1 x86, on the home page and in the Helpful links sections, the link to the Docs points to an older version of the documentation.

To fix this problem, apply patch05. The links to open the 1.2.3 documentation are fixed after you apply the patch.

Symptom: For Watson Studio Local 1.2.3.1 x86, when you start a notebook within the GPU environment, the kernel fails to start up

Apply patch03 that is located here, and then select wsl-x86-v1231-patch03-TS002286373 to correct this issue.

Symptom: Starting a notebook with Jupyter Python GPU environment fails with TensorFlow error related to AVX and there is a core dump during the pod startup

This indicates that the CPU does not support the AVX instruction set, which is required by the TensorFlow version in the GPU image. In a VM environment, this might happen if one of the machines has an old generation processor that doesn't support the AVX instruction set.

To fix the problem, contact the system administrator to inspect the flags in /proc/cpuinfo file to ensure that the machines have CPU that support the AVX instruction set.

Symptom: CUDA_ERROR_OUT_OF_MEMORY in Jupyter Environments in WSL 1.2.3.1 on POWER9 with NVIDIA GPUs and nvidia-smi output shows enough free memory

If you execute the following TensorFlow code in a Watson Studio Local 1.2.3.1 Jupyter GPU Environment with PowerAI 1.5.4 or 1.6 produces a stack trace with the following lines:
from future import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
print("GPU Available: ", tf.test.is_gpu_available()) 
Stack trace contains a line like this:
InternalError: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 33822867456
Note: Total memory reported is about 33GB but running !nvidia-smi shows that there are enough free memory available.

The problem arises because by the time applications such as Kubernetes, Docker or other containers start up, the kernel must have already made all available memory and cores online. Docker, Kubernetes, Openshift and other containers use cgroups to determine what CPU nodes and memory are available for use.

Since GPU memory is coherent with CPU memory on an AC922, the GPUs are added as memory nodes to the cpuset's cpuset.mems file. However, the GPUs take additional time to go online, sometimes as much as 5 minutes past the boot. This means that cpuset.mems gets updated a second time, after GPUs are online. In the meantime, Kubernetes and other containers create a copy of the cpusets and don't look back or get notified when nodes are brought online/offline.

This leads to a problem when Cuda tries to initialize memory on a GPU. Since the cpuset says it can't, errors such as the cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY will show up. This is documented in this Red Hat Bugzilla.

The following command
cat /sys/fs/cgroup/cpuset/kubepods.slice/cpuset.mems
on the POWER9 host machine with 2 CPUs and 4 GPUs shows something other than 0,8 252-255.

To fix this problem, run this script to diagnose and correct the issue: cpuset_check.sh.

Symptom: glusterd fails to start

The following failure is caused by a missing dependency on RHEL 7.4 and 7.5:

systemctl status glusterd
   glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

On the first master node, you can use the jounalctl command to examine the failure to start:

journalctl -u glusterd
-- Logs begin at Wed 2019-03-27 07:28:31 PDT, end at Wed 2019-03-27 07:50:44 PDT. --
Mar 27 07:28:42 Dependency failed for GlusterFS, a clustered file-system server.
Mar 27 07:28:42 Job glusterd.service/start failed with result 'dependency'.
Mar 27 07:50:34 Dependency failed for GlusterFS, a clustered file-system server.
Mar 27 07:50:34 Job glusterd.service/start failed with result 'dependency'.

Further examination shows that the rpcbind.socket is failing:

systemctl status rpcbind
  rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: inactive (dead)

To fix this failure, update the following /usr/lib/systemd/system/rpcbind.socket file:

[Unit]
Description=RPCbind Server Activation Socket

[Socket]
ListenStream=/var/run/rpcbind.sock
ListenStream=[::]:111           <==== Remove
ListenStream=0.0.0.0:111       
BindIPv6Only=ipv6-only          <==== Remove

[Install]
WantedBy=sockets.target

Save the file and reload the daemon:

systemctl daemon-reload

Restart the rpcbind service:

systemctl start rpcbind

Check the status:

systemctl status rpcbind
   rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Wed 2019-03-27 07:55:36 PDT; 4s ago
  Process: 14276 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 14277 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─14277 /sbin/rpcbind -w

Mar 27 07:55:36 Starting RPC bind service...
Mar 27 07:55:36 Started RPC bind service.

Now glusterd can start. Check the status to verify that glusterd is now working:

systemctl start glusterd

systemctl status glusterd
   glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-03-27 07:55:50 PDT; 5s ago
  Process: 14346 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 14347 (glusterd)
   CGroup: /system.slice/glusterd.service
           └─14347 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Mar 27 07:55:50 Starting GlusterFS, a clustered file-system server...
Mar 27 07:55:50 Started GlusterFS, a clustered file-system server.

Symptom: Watson Studio Local does not start, pods stuck creating or crashing, due to glusterfs in read-only or offline

Symptom: kubectl get no returns the following The connection to the server localhost:8080 was refused - did you specify the right host or port?

  1. Verify that at least two master nodes are online.
  2. Check the status of docker: systemctl status docker.
  3. Check the status of kubelet: systemctl status kubelet.
  4. Check that the kube-apiserver is running: docker ps | grep kube-apiserver
  5. If not shown: docker ps -a | grep kube-apiserver Then check the logs: docker logs <api server container id>
  6. If the logs look fine, check to verify that there is adequate space in the installer partition: df -lh
  7. Check logs for etcd container, and verify that you don't see messages about the time being out of sync.

Symptom: ibm-nginx pods CrashLoopBack

  1. Check logs for kubedns.kube-system.svc.cluster.local or usermgmt-svc.dsx.svc.cluster.local.
  2. Check logs of kubedns pods / restart kubedns
  3. On the host, check the kubedns: nslookup kubernetes.default.svc.cluster.local 10.0.0.4
  4. Exec into dsx-core pod, and check dns resolution in the pod:
    kubectl exec -it -n dsx
    \<dsx-core-pod-id\> sh
    nslookup kubernetes.default.svc.cluster.local
  5. If resolving names doesn’t work, check /etc/resolv.conf and confirm that outside DNS lookups work: nslookup <hostname on network>

A random tip to resolve this issue might be to run the following: iptables -F on each node and then reboot.

Symptom: Failed to pull image from docker registry

If service startup log shows the following for a long time with status "ImagePullBackOff", "ErrImagePull" or "ContainerCreating" then the docker registry might be down or has a problem, which is also a pod on the system.

  1. Verify that the gluster volume is online and not read-only.
  2. Verify that you can connect to the docker registry from the node: curl -k https://9.87.654.321:31006/v2/_catalog
  3. Show catalog of images. If you see connection refused, restart registry pod and check logs, check if any firewalls running, and check if selinux enabled:
    systemctl status firewalld
    systemctl status iptables
    sestatus

Symptom: Connecting to Watson Studio Local from browser fails but ALL pods show as running

  1. Verify that you can ping the proxy IP or hostname.
  2. Connect to one of the nodes and attempt to connect to the site internally.
  3. Verify site IP:
    kubectl get svc --all-namespaces | grep
    ibm-nginx
    
    dsx           ibm-nginx-svc                               ClusterIP
      10.9.87.654     9.87.654.321   443/TCP                        40d

    The second IP or IPs is where the site can be accessed from:

    curl -k https://<ip of proxy or ip of one of the masters if lb used>/auth/login/login.html

    Should show something similar to this:

    <!DOCTYPE html>
    <html>
            <head>
           \<title\>Data Science Experience Local</title>
           <link rel="stylesheet" type="text/css"
    href="ap-components-react.min.css">
           <link rel="stylesheet" type="text/css"
    href="DSXLogin.css">
           <link rel="stylesheet" type="text/css" href="nav.css">
       </head>
       <body>
           <div id='loginComponent'></div>
           <script src='login.js' ></script>
       </body>  
    </html>

    If this page comes up, Watson Studio Local works internally.

  4. Check if firewalls are enabled:
    systemctl status firewalls
    systemctl status iptables

    Verify that there are no firewalls, proxies, or blocked 443 port from the system you’re trying to access the site to the IP of Watson Studio Local.

Symptom: Admin console fails to list images with status code 404

This 404 error means that something went wrong with the image management preparation job. To get the log of the preparation job:

  1. Get the pod name for the preparation job:
    # kubectl -n dsx get po -a | grep imagemgmt-preparation-job
    imagemgmt-preparation-job- <random>                               0/1       Completed   0          22d
  2. Print the log:
    kubectl -n dsx logs imagemgmt-preparation-job- <random> 

Most likely some steps failed, and the last step to expose the nginx endpoint was not run.

To fix this manually, go into a running image-mgmt pod:

kubectl -n dsx exec -it $(kubectl -n dsx get po | grep image | grep Running | awk '{print $1}' | head -1) sh

Then manually run the scripts for the image preparation job:

cd /scripts; ./retag_images.sh; node ./builtin-image-info.js; ./update_nginx.sh; /user-home/.scripts/system/utils/nginx