Table of contents

Monitor cluster nodes

Your IBM Data Science Local (DSX) Local deployment consists of three types of nodes:

  • Control nodes are the nodes that manage your Kubernetes cluster and your DSX Local deployment.

    By default, the cluster has three control nodes. If you notice that a node is down, attempt to restore it to prevent outages. The cluster can continue to run if one node fails. However, if two nodes fail, your cluster fails.

    In the admin dashboard, control node names include the term manage.

  • Storage nodes are the nodes where DSX Local metadata and any data that you load into DSX Local is stored.

    By default, the cluster has three storage nodes. The data on these nodes is replicated across each node, so that if a node fails, you can still access the data. DSX Local can continue to run if two nodes fail.

    Tip: If you run out of space on your storage nodes, add XFS-formatted disks to each node and extend the Logical Volume Management (LVM) partition to include the disks. If possible, ensure that the disks are the same size.

    In the admin dashboard, storage node names include the term storage.

  • Compute nodes are the nodes where DSX services, such as Spark, run.

    Unlike storage nodes, compute nodes are not replicated. When a new process starts, Kubernetes determines which node has sufficient capacity to run the process. DSX Local can continue to run when multiple compute nodes fail. However, you might notice that performance decreases when multiple nodes are down.

    Additionally, if a node fails, Kubernetes attempts to bring any active processes up on another node. While Kubernetes attempts to bring up the processes, you might experience an outage. If Kubernetes cannot bring the processes up on another node and the outage continues, contact IBM Software Support.

    In the admin dashboard, compute node names include the term compute.

Monitor node health

If you want a high-level overview of the status of your cluster, you can monitor the health of your cluster nodes from the Dashboard page. You can access the Dashboard page from the menu icon: (The menu icon).

Specifically, you can monitor:

  • CPU usage
  • Memory usage
  • Disk usage

For compute nodes, the usage of CPU and memory is measured against the CPU and memory that the DSX users reserved.

Each card on the Dashboard page shows the average usage across all of the nodes:

Sample card that shows the average CPU usage across the compute nodes in the cluster

However, you can expand the cards to see the specific usage for each node:

Sample card that shows the average CPU usage across the compute nodes in the cluster

This data is refreshed every 10 seconds.

By default, Kubernetes attempts to balance the load across servers.

Contact IBM Software Support if you notice that all of the nodes in a group are overloaded for extended time. Nodes are overloaded when they run above 90% usage.

Nodes can become overloaded when:

  • You have more users than your cluster configuration can handle. For example, your cluster doesn't have sufficient CPU, memory, or storage.
  • A node fails and other nodes need to handle requests that would normally be handled by that node.

Monitor network usage

If you encounter an issue with DSX Local, you can view the recent network traffic in your cluster on the Dashboard page. You can view the number of megabytes that were sent and received across the nodes of the cluster over the last 20 minutes.

Example of the Network usage card