Troubleshoot Watson Studio Local
Watson Studio Local provides various methods, scripts, and support for troubleshooting your cluster.
- Troubleshoot the installation
- Troubleshoot common problems
- Troubleshoot Watson Studio Local on ICP
- Troubleshoot your system performance
- Troubleshoot your system with administration utilities
- Restart the Watson Studio Local cluster
- Disable IPv6
Troubleshoot your system performance
If performance slows for computing jobs, or stalls to the point where no new projects or assets can be created anymore, reserve more CPU and memory for your runtime in the Environments page. Additionally, you can stop the runtime environment for idle notebooks to free up more resource.
Troubleshoot your system with administration utilities
To automatically troubleshoot and repair common problems in your Watson Studio Local cluster, run the /wdp/utils/cluster_repair_utility.sh command. Enter cluster_repair_utility.sh -h for a help page.
You can also troubleshoot your entire Watson Studio Local cluster with
/wdp/utils/admin-utils.sh script by entering the following command:
admin-utils.sh --user username --password --port portnumber --key-pair ssh_key_pair_file
username represents your username for the cluster,
--password prompts for the password,
portnumber represents the SSH
port for the cluster (default is 22 for all nodes), and
represents an optional private SSH key file.
admin-utils command performs the following tasks:
- Checks whether all nodes are up and accessible with SSH
- Checks free disk space on all nodes
- Checks whether the Docker and Kubelet are running
- Checks whether Gluster is installed and the volumes are up
If all of the checks pass, then the script automatically logs everything that is not in the state
Running into the following compressed file:
/wdp/utils/admin-utils-timestamp.tar.gz. Send this file to IBM support.
Restart the Watson Studio Local cluster
If you need to reboot the entire Watson Studio Local cluster, the nodes should start back up in the following order.
For a seven-node cluster:
- First control/storage node
- Second control/storage node
- Third control/storage node
- Compute nodes in any order
- Deployment nodes in any order
To reboot a four-node cluster:
- Enter the
run df -h /varcommand on all nodes (make sure they are not full 100%).
- After rebooting a node and before rebooting the next one, enter the following command to ensure
all pods are running:
kubectl get po --all-namespaces | grep -v Running. Repeat the command until all pods are running.
Capacity:alpha.kubernetes.io/nvidia-gpu: 0), then rerun cudaInit_ppc64le to recreate /dev/nvidia-uvm. Then restart kubelet.
If you disable Internet Protocol version 6 (IPv6) on RHEL 7.3 or 7.4, complete the following steps to avoid the symptoms described in the Kubernetes Troubleshooting FAQ, for example, the RPCBIND service failing to start causing glusterd not to start in RHEL 7.3:
- In the /etc/sysctl.conf file, add the following
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1
- Enter the following command:
- Restart Watson Studio Local.