Table of contents

Troubleshoot Data Science Experience Local

DSX Local provides various methods, scripts, and support for troubleshooting your cluster.

Troubleshoot your system performance

If performance slows for computing jobs, or stalls to the point where no new projects or assets can be created anymore, reserve more CPU and memory for your runtime in the Environments page. Additionally, you can stop the runtime environment for idle notebooks to free up more resource.

Administration utilities

To automatically troubleshoot your entire DSX Local cluster, run the /wdp/utils/admin-utils.sh script by entering the following command:

admin-utils.sh --user username --password --port portnumber --key-pair ssh_key_pair_file

where username represents your username for the cluster, --password prompts for the password, portnumber represents the SSH port for the cluster (default is 22 for all nodes), and ssh_key_pair_file represents an optional private SSH key file.

The command performs the following tasks:

  • Checks whether all nodes are up and accessible with SSH
  • Checks free disk space on all nodes
  • Checks whether the Docker and Kubelet are running
  • Checks whether Gluster is installed and the volumes are up

If all of the checks pass, then the script automatically logs everything that is not in the state Ready or Running into the following compressed file: /wdp/utils/admin-utils-timestamp.tar.gz. Please send this file to IBM support.

Restart the DSX Local cluster

Before you restart a node, enter the following command on the node to cleanly shut down DSX Local:

systemctl stop kubelet && systemctl stop docker && systemctl stop glusterd

If you need to reboot the entire DSX Local cluster, the nodes should start back up in the following order.

For an eight-node cluster:

  1. First control/storage node
  2. Second control/storage node
  3. Third control/storage node
  4. Compute nodes in any order
  5. Deployment nodes in any order

To reboot a five-node cluster:

  1. Enter the run df -h /var command on all nodes (make sure they are not full 100%).
  2. After rebooting a node and before rebooting the next one, enter the following command to ensure all pods are running: kubectl get po --all-namespaces | grep -v Running. Repeat the command until all pods are running.

Disable IPv6

If you disable Internet Protocol version 6 (IPv6) on RHEL 7.3 or 7.4, complete the following steps to avoid the symptoms described in the Kubernetes Troubleshooting FAQ, for example, the RPCBIND service failing to start causing glusterd not to start in RHEL 7.3:

  1. In the /etc/sysctl.conf file, add the following lines:

    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    
  2. Enter the following command: dracut -f

  3. Restart DSX Local.