Troubleshoot Data Science Experience Local
DSX Local provides various methods, scripts, and support for troubleshooting your cluster.
- Troubleshoot the installation
- Troubleshoot common problems
- Troubleshoot Data Science Experience Local on ICP
- Troubleshoot your system performance
- Troubleshoot your system with administration utilities
- Restart the DSX Local cluster
- Disable IPv6
Troubleshoot your system performance
If performance slows for computing jobs, or stalls to the point where no new projects or assets can be created anymore, reserve more CPU and memory for your runtime in the Environments page. Additionally, you can stop the runtime environment for idle notebooks to free up more resource.
To automatically troubleshoot your entire DSX Local cluster, run the
/wdp/utils/admin-utils.sh script by entering the following command:
admin-utils.sh --user username --password --port portnumber --key-pair ssh_key_pair_file
username represents your username for the cluster,
--password prompts for the password,
portnumber represents the SSH port for the cluster (default is 22 for all nodes), and
ssh_key_pair_file represents an optional private SSH key file.
The command performs the following tasks:
- Checks whether all nodes are up and accessible with SSH
- Checks free disk space on all nodes
- Checks whether the Docker and Kubelet are running
- Checks whether Gluster is installed and the volumes are up
If all of the checks pass, then the script automatically logs everything that is not in the state
Running into the following compressed file:
/wdp/utils/admin-utils-timestamp.tar.gz. Please send this
file to IBM support.
Restart the DSX Local cluster
Before you restart a node, enter the following command on the node to cleanly shut down DSX Local:
systemctl stop kubelet && systemctl stop docker && systemctl stop glusterd
If you need to reboot the entire DSX Local cluster, the nodes should start back up in the following order.
For an eight-node cluster:
- First control/storage node
- Second control/storage node
- Third control/storage node
- Compute nodes in any order
- Deployment nodes in any order
To reboot a five-node cluster:
- Enter the
run df -h /varcommand on all nodes (make sure they are not full 100%).
- After rebooting a node and before rebooting the next one, enter the following command to ensure all pods are running:
kubectl get po --all-namespaces | grep -v Running. Repeat the command until all pods are running.
If you disable Internet Protocol version 6 (IPv6) on RHEL 7.3 or 7.4, complete the following steps to avoid the symptoms described in the Kubernetes Troubleshooting FAQ, for example, the RPCBIND service failing to start causing glusterd not to start in RHEL 7.3:
/etc/sysctl.conffile, add the following lines:
net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1
Enter the following command:
Restart DSX Local.