Table of contents

Troubleshooting common Watson Studio Local problems

The following section provides various methods, scripts, and support for troubleshooting common Watson Studio Local problems.

Symptom: An asset that is running fine within a Watson studio project, but then returns a FileNotFoundError exception when it's running inside a project release in Watson Machine Learning.

This problem could happen if the directory mentioned in the exception is empty in the project. The directory is then not committed and pushed to Git and thereby missing from the project release. To fix the issue, create a .keep file in the directory within the project and do a commit and push.

Symptom: An "Invalid Access Token” error is returned when a Github/Bitbucket access token contains a “/” or other special characters.

To fix this problem, apply Patch 06.

Symptom: Updating the tag for a project release that is created from a Github/Bitbucket repo causes your system to hang.

To fix this problem, apply Patch 06.

Symptom: The following error occurs when you use the graphviz package within a Jupyter notebook

FileNotFoundError: [Errno 2] No such file or directory: 'dot'
ExecutableNotFound: failed to execute ['dot', '-Tpng'], make sure the Graphviz executables are on your systems' PATH

To fix this problem for Jupyter 2.7 and Jupyter 3.5 environments, apply Patch 06.

Symptom: When certain operations within a project fail, user-sensitive information is logged in the error message.

To fix this problem, apply Patch 06.

Symptom: One of the docker images includes scripts that have hardcoded credentials in the script files.

To fix this problem, apply Patch 06.

Symptom: The startup scripts of certain user pods run copy operations as root and can lead to a security vulnerability.

To fix this problem, apply Patch 06.

Symptom: An authentication error is returned when the user name provided when a token is created for Github/Bitbucket access and it contains “.”, “@” or “\”

To fix this problem, apply Patch 06.

Symptom: For Watson Studio Local 1.2.3.1 x86, from User settings, LDAP users could change their password, which could corrupt a user's LDAP profile.

To fix this problem, apply patch05. The change password option is now removed for LDAP users after you apply the patch.

The key file in a user's home directory is used to encrypt the user's credentials. In Watson Studio Local 1.2.3.1 x86, this file had read permissions for all users causing a security concern.

To fix this problem, apply patch05. The directory permission for a user's home directory is now only readable by the user for newly created users after you apply the patch. The patch will not affect the permissions of the user's home directory for existing users. These should be changed manually.

For Watson Studio Local 1.2.3.1 x86, On the Environments tab in the Projects page, the link to the terminal for Jupyter with Python 3.5 for GPU opens the incorrect terminal.

To fix this problem, apply patch05. The link is fixed after you apply the patch.

For Watson Studio Local 1.2.3.1 x86, in the Projects page, within the Assets tab under Notebooks, the link to the Environments always opens the Jupyter with Python 2.7, Scala 2.11, R 3.4.3, Spark 2.0.2 environments.

To fix this problem, apply patch05. The links to open the appropriate environment are fixed after you apply the patch.

For Watson Studio Local 1.2.3.1 x86, on the home page and in the Helpful links sections, the link to the Docs points to an older version of the documentation.

To fix this problem, apply patch05. The links to open the 1.2.3 documentation are fixed after you apply the patch.

Symptom: For Watson Studio Local 1.2.3.1 x86, when you start a notebook within the GPU environment, the kernel fails to start up

Apply patch03 that is located here, and then select wsl-x86-v1231-patch03-TS002286373 to correct this issue.

Symptom: Starting a notebook with Jupyter Python GPU environment fails with TensorFlow error related to AVX and there is a core dump during the pod startup

This indicates that the CPU does not support the AVX instruction set, which is required by the TensorFlow version in the GPU image. In a VM environment, this might happen if one of the machines has an old generation processor that doesn't support the AVX instruction set.

To fix the problem, contact the system administrator to inspect the flags in /proc/cpuinfo file to ensure that the machines have CPU that support the AVX instruction set.

Symptom: glusterd fails to start

The following failure is caused by a missing dependency on RHEL 7.4 and 7.5:

systemctl status glusterd
   glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: inactive (dead)

On the first master node, you can use the jounalctl command to examine the failure to start:

journalctl -u glusterd
-- Logs begin at Wed 2019-03-27 07:28:31 PDT, end at Wed 2019-03-27 07:50:44 PDT. --
Mar 27 07:28:42 Dependency failed for GlusterFS, a clustered file-system server.
Mar 27 07:28:42 Job glusterd.service/start failed with result 'dependency'.
Mar 27 07:50:34 Dependency failed for GlusterFS, a clustered file-system server.
Mar 27 07:50:34 Job glusterd.service/start failed with result 'dependency'.

Further examination shows that the rpcbind.socket is failing:

systemctl status rpcbind
  rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: inactive (dead)

To fix this failure, update the following /usr/lib/systemd/system/rpcbind.socket file:

[Unit]
Description=RPCbind Server Activation Socket

[Socket]
ListenStream=/var/run/rpcbind.sock
ListenStream=[::]:111           <==== Remove
ListenStream=0.0.0.0:111       
BindIPv6Only=ipv6-only          <==== Remove

[Install]
WantedBy=sockets.target

Save the file and reload the daemon:

systemctl daemon-reload

Restart the rpcbind service:

systemctl start rpcbind

Check the status:

systemctl status rpcbind
   rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: active (running) since Wed 2019-03-27 07:55:36 PDT; 4s ago
  Process: 14276 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 14277 (rpcbind)
   CGroup: /system.slice/rpcbind.service
           └─14277 /sbin/rpcbind -w

Mar 27 07:55:36 Starting RPC bind service...
Mar 27 07:55:36 Started RPC bind service.

Now glusterd can start. Check the status to verify that glusterd is now working:

systemctl start glusterd

systemctl status glusterd
   glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-03-27 07:55:50 PDT; 5s ago
  Process: 14346 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 14347 (glusterd)
   CGroup: /system.slice/glusterd.service
           └─14347 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO

Mar 27 07:55:50 Starting GlusterFS, a clustered file-system server...
Mar 27 07:55:50 Started GlusterFS, a clustered file-system server.

Symptom: Watson Studio Local does not start, pods stuck creating or crashing, due to glusterfs in read-only or offline

Symptom: kubectl get no returns the following The connection to the server localhost:8080 was refused - did you specify the right host or port?

  1. Verify that at least two master nodes are online.
  2. Check the status of docker: systemctl status docker.
  3. Check the status of kubelet: systemctl status kubelet.
  4. Check that the kube-apiserver is running: docker ps | grep kube-apiserver
  5. If not shown: docker ps -a | grep kube-apiserver Then check the logs: docker logs <api server container id>
  6. If the logs look fine, check to verify that there is adequate space in the installer partition: df -lh
  7. Check logs for etcd container, and verify that you don't see messages about the time being out of sync.

Symptom: ibm-nginx pods CrashLoopBack

  1. Check logs for kubedns.kube-system.svc.cluster.local or usermgmt-svc.dsx.svc.cluster.local.
  2. Check logs of kubedns pods / restart kubedns
  3. On the host, check the kubedns: nslookup kubernetes.default.svc.cluster.local 10.0.0.4
  4. Exec into dsx-core pod, and check dns resolution in the pod:
    kubectl exec -it -n dsx
    \<dsx-core-pod-id\> sh
    nslookup kubernetes.default.svc.cluster.local
  5. If resolving names doesn’t work, check /etc/resolv.conf and confirm that outside DNS lookups work: nslookup <hostname on network>

A random tip to resolve this issue might be to run the following: iptables -F on each node and then reboot.

Symptom: Failed to pull image from docker registry

If service startup log shows the following for a long time with status "ImagePullBackOff", "ErrImagePull" or "ContainerCreating" then the docker registry might be down or has a problem, which is also a pod on the system.

  1. Verify that the gluster volume is online and not read-only.
  2. Verify that you can connect to the docker registry from the node: curl -k https://9.87.654.321:31006/v2/_catalog
  3. Show catalog of images. If you see connection refused, restart registry pod and check logs, check if any firewalls running, and check if selinux enabled:
    systemctl status firewalld
    systemctl status iptables
    sestatus

Symptom: Connecting to Watson Studio Local from browser fails but ALL pods show as running

  1. Verify that you can ping the proxy IP or hostname.
  2. Connect to one of the nodes and attempt to connect to the site internally.
  3. Verify site IP:
    kubectl get svc --all-namespaces | grep
    ibm-nginx
    
    dsx           ibm-nginx-svc                               ClusterIP
      10.9.87.654     9.87.654.321   443/TCP                        40d

    The second IP or IPs is where the site can be accessed from:

    curl -k https://<ip of proxy or ip of one of the masters if lb used>/auth/login/login.html

    Should show something similar to this:

    <!DOCTYPE html>
    <html>
            <head>
           \<title\>Data Science Experience Local</title>
           <link rel="stylesheet" type="text/css"
    href="ap-components-react.min.css">
           <link rel="stylesheet" type="text/css"
    href="DSXLogin.css">
           <link rel="stylesheet" type="text/css" href="nav.css">
       </head>
       <body>
           <div id='loginComponent'></div>
           <script src='login.js' ></script>
       </body>  
    </html>

    If this page comes up, Watson Studio Local works internally.

  4. Check if firewalls are enabled:
    systemctl status firewalls
    systemctl status iptables

    Verify that there are no firewalls, proxies, or blocked 443 port from the system you’re trying to access the site to the IP of Watson Studio Local.

Symptom: Admin console fails to list images with status code 404

This 404 error means that something went wrong with the image management preparation job. To get the log of the preparation job:

  1. Get the pod name for the preparation job:
    # kubectl -n dsx get po -a | grep imagemgmt-preparation-job
    imagemgmt-preparation-job- <random>                               0/1       Completed   0          22d
  2. Print the log:
    kubectl -n dsx logs imagemgmt-preparation-job- <random> 

Most likely some steps failed, and the last step to expose the nginx endpoint was not run.

To fix this manually, go into a running image-mgmt pod:

kubectl -n dsx exec -it $(kubectl -n dsx get po | grep image | grep Running | awk '{print $1}' | head -1) sh

Then manually run the scripts for the image preparation job:

cd /scripts; ./retag_images.sh; node ./builtin-image-info.js; ./update_nginx.sh; /user-home/.scripts/system/utils/nginx