Table of contents

Troubleshoot the Watson Studio Local installation

You can troubleshoot a failed Watson Studio Local installation in a variety of ways.

If an installation page displays improperly due to a network issue, refresh the page.

Retry or skip a failing installation step

If an installation step fails, you can view the log and contact IBM support. To avoid unexpected issues, make sure to resolve the issue in the log before you retry or skip the step.

Figure: Log file for a failed installation step with the retry or skip option

Installation failure screencap

Tip: If you are installing with wdp.conf and a step fails due to a timeout, you can try prolonging the step timeout by specifying seconds in the step_timeout parameter, for example, step_timeout=14400 (time out in 4 hours). Alternatively, before the installation begins, you can prolong the timeout for all of the helm installation steps by specifying seconds in the helm_request_timeout parameter, for example, helm_request_timeout=14400.

View progress for a dead session

If you accidentally disconnect from the installation session, you can still view the progress through the log output on InstallPackage/tmp/SetupCluster.out (outputs steps that are running on the docker container named wdp-ansible). When the installation finishes successfully, the last line of the file displays WDP_FINISH|0| where represents the web address to sign in to your Watson Studio Local client, for example, If the log stops on an installation step failure, for example, Step fail - Wait for user's response and Wait for action file to show up, you can enter either InstallPackage/ retry to retry the step or InstallPackage/ skip to skip the step.

Resume installation from a specific step

If the log file hangs for more than 2 hours, then either the installation or container probably failed (it is possible that the node that is running the installation package was rebooted during the installation). Verify the failing step in the InstallPackage/tmp/SetupCluster.out log file, and either clean up or manually finish the currently failing step. Then, create a wdp.conf file (ensuring all parameters match your previous installation) with a new line jump_install=2 where 2 represents the step number to continue the installation from, and then rerun the installation by command line so that it detects the wdp.conf file.

Monitor system performance during the installation process

During the installation process, Watson Studio Local automatically runs a monitoring tool in the background to collect system performance statistics on every node in the cluster. The monitoring tool automatically terminates once the installation has either succeeded or failed. All data is written to a monitor_results.csv file located in the installation directory on each node. The tool also generates graphs in each installation directory, for example, cpu.png and memory_buff.png, to visualize monitoring results so that users can detect any performance issues from the images.

To check that the monitoring tool is running, enter the following command: ps aux | grep monitor_install | grep -v grep and verify that the process is running with a PID (in this example, 3393):
root  3393  0.0 0.0 18052 1560 ? S 11:20 0:00 bash -c monitor_install '/wdp' '/wdp/AnsiblePlaybooks/hosts' '--become --become-method=sudo' '7200'
To manually stop writing monitoring data to files on all the nodes, terminate the process using its PID:
kill 3393
To manually rerun the monitoring tool after the installation completed, enter the following command in the installation directory:
./dpctl monitor --config=wdp.conf

The monitor_results.csv file contains the following output:

  • Output from the vmstat command, used to monitor and collect virtual memory stats including memory, swap, system and CPU information.
  • Output from the iostat command, used to collect input and output statistics for both the installation and the data paths.
  • Output from the sar command, used to collect network interface information.

Common problems

Validate the following requirements to prevent an installation failure:

  • After installation, if the host name or IP address of a node is changed, the Watson Studio Local cluster will have issues. Some pods cannot run. Changing the host name or IP address of a Watson Studio Local node after installation is not supported.
  • Ensure that all servers can ping each other by hostname.
  • Ensure that a default gateway is set on each server (by using the command route).
  • If you used the SSH key installation, ensure that you can SSH without a password from the node you are installing from to the other nodes in the cluster.
  • Ensure that the required partitions show, and that the partitions have sufficient space on them.
  • Ensure that the proxy IP is used and cannot be pinged, and is not associated with any other system.
  • Ensure that SELinux is set correctly on each node to either permissive or enforcing, for example, by using the sestatus command to check status on each node.
  • Ensure that time syncing is working and set up on each node, for example, use ntpq -p for NAP and use chronic tracking for chrony. If the nodes are not synced, the pre-validation script displays the following message:
    ===== NTP configuration summary ===== is synced to NTP server, NTP is not synced NTP is not synced
  • Ensure that firewalld and iptables are not running or enabled.

    (systemctl status `firewalld`, systemctl status `iptables`).