Table of contents

Troubleshoot common Watson Studio Local problems

The following section provides various methods, scripts, and support for troubleshooting common Watson Studio Local problems.

Symptom: Watson Studio Local does not start, pods stuck creating or crashing, due to glusterfs in read-only or offline

  1. Verify glusters bricks are online:
    kubectl get po —all-namespaces | grep gluster

    should show online:

    kube-system   glusterfs-dldrn                   
                        1/1       Running             1           40d

    exec into the pod and check peer and brick status:

    kubectl exec -it -n kube-system glusterfs-dldrn
    /bin/bash
    gluster peer status

    should show 2 peers as Connected:

    Number of Peers: 2
    Hostname: 9.87.654.321
    Uuid: dae...6f2
    State: Peer in Cluster (Connected)
  2. Enter gluster volume status. Verify that you see 3 bricks and show a port for the brick and that it is online:
    Status of volume: dsx-cloudant
    
    Gluster process                             TCP Port  RDMA Port 
    Online  Pid
    ------------------------------------------------------------------------------
    Brick 9.87.654.321:/data/dsx-cloudant       49152     0          Y 
         331
  3. If you do not, then run the following to attempt to correct the volumes that are not online: gluster volume stop <volume name>.
  4. Select Y to stop the volume.
  5. Enter gluster volume start <volume name>.
  6. Verify with gluster volume status, correct for all the volumes and then Watson Studio Local should recover.

Symptom: kubectl get no returns the following The connection to the server localhost:8080 was refused - did you specify the right host or port?

  1. Verify that at least two master nodes are online.
  2. Check the status of docker: systemctl status docker.
  3. Check the status of kubelet: systemctl status kubelet.
  4. Check that the kube-apiserver is running: docker ps | grep kube-apiserver
  5. If not shown: docker ps -a | grep kube-apiserver Then check the logs: docker logs <api server container id>
  6. If the logs look fine, check to verify that there is adequate space in the installer partition: df -lh
  7. Check logs for etcd container, and verify that you don't see messages about the time being out of sync.

Symptom: ibm-nginx pods CrashLoopBack

  1. Check logs for kubedns.kube-system.svc.cluster.local or usermgmt-svc.dsx.svc.cluster.local.
  2. Check logs of kubedns pods / restart kubedns
  3. On the host, check the kubedns: nslookup kubernetes.default.svc.cluster.local 10.0.0.4
  4. Exec into dsx-core pod, and check dns resolution in the pod:
    kubectl exec -it -n dsx
    \<dsx-core-pod-id\> sh
    nslookup kubernetes.default.svc.cluster.local
  5. If resolving names doesn’t work, check /etc/resolv.conf and confirm that outside DNS lookups work: nslookup <hostname on network>

A random tip to resolve this issue might be to run the following: iptables -F on each node and then reboot.

Symptom: Failed to pull image from docker registry

If service startup log shows the following for a long time with status "ImagePullBackOff", "ErrImagePull" or "ContainerCreating" then the docker registry might be down or has a problem, which is also a pod on the system.

  1. Verify that the gluster volume is online and not read-only.
  2. Verify that you can connect to the docker registry from the node: curl -k https://9.87.654.321:31006/v2/_catalog
  3. Show catalog of images. If you see connection refused, restart registry pod and check logs, check if any firewalls running, and check if selinux enabled:
    systemctl status firewalld
    systemctl status iptables
    sestatus

Symptom: Connecting to Watson Studio Local from browser fails but ALL pods show as running

  1. Verify that you can ping the proxy IP or hostname.
  2. Connect to one of the nodes and attempt to connect to the site internally.
  3. Verify site IP:
    kubectl get svc --all-namespaces | grep
    ibm-nginx
    
    dsx           ibm-nginx-svc                               ClusterIP
      10.9.87.654     9.87.654.321   443/TCP                        40d

    The second IP or IPs is where the site can be accessed from:

    curl -k https://<ip of proxy or ip of one of the masters if lb used>/auth/login/login.html

    Should show something similar to this:

    <!DOCTYPE html>
    <html>
            <head>
           \<title\>Data Science Experience Local</title>
           <link rel="stylesheet" type="text/css"
    href="ap-components-react.min.css">
           <link rel="stylesheet" type="text/css"
    href="DSXLogin.css">
           <link rel="stylesheet" type="text/css" href="nav.css">
       </head>
       <body>
           <div id='loginComponent'></div>
           <script src='login.js' ></script>
       </body>  
    </html>

    If this page comes up, Watson Studio Local works internally.

  4. Check if firewalls are enabled:
    systemctl status firewalls
    systemctl status iptables

    Verify that there are no firewalls, proxies, or blocked 443 port from the system you’re trying to access the site to the IP of Watson Studio Local.

Symptom: Admin console fails to list images with status code 404

This 404 error means that something went wrong with the image management preparation job. To get the log of the preparation job:

  1. Get the pod name for the preparation job:
    # kubectl -n dsx get po -a | grep imagemgmt-preparation-job
    imagemgmt-preparation-job- <random>                               0/1       Completed   0          22d
  2. Print the log:
    kubectl -n dsx logs imagemgmt-preparation-job- <random> 

Most likely some steps failed, and the last step to expose the nginx endpoint was not run.

To fix this manually, go into a running image-mgmt pod:

kubectl -n dsx exec -it $(kubectl -n dsx get po | grep image | grep Running | awk '{print $1}' | head -1) sh

Then manually run the scripts for the image preparation job:

cd /scripts; ./retag_images.sh; node ./builtin-image-info.js; ./update_nginx.sh; /user-home/.scripts/system/utils/nginx