Table of contents

Troubleshoot Watson Studio Local

Watson Studio Local provides various methods, scripts, and support for troubleshooting your cluster.

Troubleshoot your system performance

If performance slows for computing jobs, or stalls to the point where no new projects or assets can be created anymore, reserve more CPU and memory for your runtime in the Environments page. Additionally, you can stop the runtime environment for idle notebooks to free up more resource.

Troubleshoot your system with administration utilities

To automatically troubleshoot and repair common problems in your Watson Studio Local cluster, run the /wdp/utils/ command. Enter -h for a help page.

You can also troubleshoot your entire Watson Studio Local cluster with the /wdp/utils/ script by entering the following command: --user username --password --port portnumber --key-pair ssh_key_pair_file

where username represents your username for the cluster, --password prompts for the password, portnumber represents the SSH port for the cluster (default is 22 for all nodes), and ssh_key_pair_file represents an optional private SSH key file.

The admin-utils command performs the following tasks:

  • Checks whether all nodes are up and accessible with SSH
  • Checks free disk space on all nodes
  • Checks whether the Docker and Kubelet are running
  • Checks whether Gluster is installed and the volumes are up

If all of the checks pass, then the script automatically logs everything that is not in the state Ready or Running into the following compressed file: /wdp/utils/admin-utils-timestamp.tar.gz. Send this file to IBM support.

Restart the Watson Studio Local cluster

If you need to reboot the entire Watson Studio Local cluster, the nodes should start back up in the following order.

For a seven-node cluster:

  1. First control/storage node
  2. Second control/storage node
  3. Third control/storage node
  4. Compute nodes in any order
  5. Deployment nodes in any order

To reboot a four-node cluster:

  1. Enter the run df -h /var command on all nodes (make sure they are not full 100%).
  2. After rebooting a node and before rebooting the next one, enter the following command to ensure all pods are running: kubectl get po --all-namespaces | grep -v Running. Repeat the command until all pods are running.
Tip: If your GPUs disappear from Watson Studio Local after a reboot of the POWER system (the /dev/nvidia-uvm device file no longer appears on the system and kubernetes describe node shows 0), then rerun cudaInit_ppc64le to recreate /dev/nvidia-uvm. Then restart kubelet.

Disable IPv6

If you disable Internet Protocol version 6 (IPv6) on RHEL 7.3 or 7.4, complete the following steps to avoid the symptoms described in the Kubernetes Troubleshooting FAQ, for example, the RPCBIND service failing to start causing glusterd not to start in RHEL 7.3:

  1. In the /etc/sysctl.conf file, add the following lines:
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
  2. Enter the following command: dracut -f
  3. Restart Watson Studio Local.