Table of contents

Spark Canvas flows add-on (Beta)

You can build a machine learning model as a flow by using the Spark Canvas to conveniently transform data, train the model, and evaluate it. For a compatibility report of data sources supported by Spark Canvas in Watson Studio Local, see Software Product Compatibility.

Want to see Spark Canvas in action? Watch this short video:
Figure 1. Video that shows how Spark Canvas integrates with Watson Studio Local.

Requirements

A Spark Canvas runtime is created for each user per project. Each runtime consumes 1 CPU and 5GB of memory.

Install the Spark Canvas flows add-on

To install the Spark Canvas flows add-on into Watson Studio Local, complete the following steps:

  1. Sign in as root to the first master node of the Watson Studio Local cluster as a user who is a member of the docker group for all nodes of the cluster.
  2. On the first master node, download the Spark Canvas installation package spark-canvas-1.2.1.tgz.
  3. Extract the contents of the Spark Canvas installation package: tar xvf spark-canvas-1.2.1.tgz.
  4. In the Spark Canvas directory, run install.sh.

Build a model with the Spark Canvas

To build a machine learning flow by using the Spark Canvas, complete the following steps:

  1. Open a project in the Watson Studio Local client.
  2. From your project, click the Assets tab and click Modeler Flows.
  3. Click Add Modeler Flow.
  4. Click the Blank tab and select Scala Spark 2. For the Spark runtime, select either the Watson Studio Local Spark or any registered Hadoop system that has Spark 2. Alternatively, you can click the From File tab to import a preexisting .STR file.
  5. Name and describe your machine learning flow.
  6. Click the Create button. The Spark Canvas tool opens so that you can build your flow.
  7. Add the data from your project to the Spark Canvas. Click the Find and Add Data icon ( Shows the Find and Add Data icon) for a list of the data sets or connections to choose from.
  8. Open the node palette by clicking the palette icon ( palette icon).
  9. From the node palette, select a node and drag it to the Spark Canvas. See The node palette for descriptions.
  10. From the Spark Canvas, double-click a node to specify its properties.
  11. Draw a connector from the data set to the node.

    Shows a two nodes with a connector

  12. Continue to add operators or other nodes as needed to build your model.

    Shows a model with several nodes

Options for building a model

  • You can run any terminal node within the Spark Canvas without running the entire model. Right-click the node and select Run.
  • To view the results of an Outputs node, run the node, such as a table node, and then click the View outputs and versions icon ( Shows the View outputs and versions node). In the side palette, on the Outputs tab, click the object, such as a table, to open it.

The node palette

Transformations nodes
These transformation nodes enable you to work with data directly so that you can create files that are easier to manipulate. You can filter unnecessary data rows or fill out a data set by adding a column of data.
Filter Rows
Filter rows based on criteria. Set a condition by using a Spark SQL expression to create a new data set.
Select Columns
Select specific columns of data for work later in the machine learning flow. Other columns are excluded from this working data set.
Add Column
Use the Spark SQL expression language to combine existing columns to create new columns or to create a new column of constant values.
Rename Column
Rename a column to better reflect the content or to match the same data in another table.
Sample Rows
Create a random sample of rows, with or without replacement, with the option to provide a limit on the sample size.
SQL Transform
Transform data by running an arbitrary SQL statement. For example, In the situation where there aren't any key columns in common, a dummy key column can be fabricated to allow a join to be constructed. The SQL transformer can be connected to any node and treats the incoming dataframe as a SQL table that can be queried and augmented with new fields based on the SQL executed. This incoming dataframe can be referenced in queries using the table name THIS.
Select Distinct Rows
Specify a number of unique keys to return one record from each of the resulting groups.
Modeling nodes
Logistic Regression
Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range.
Decision Tree Classifier
Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both binary and multiclass labels, as well as both continuous and categorical features.
Random Forest Classifier
Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
Gradient Boosted Tree Classifier
Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.
Linear Regression
Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
Generalized Linear Regression
Generalization of ordinary linear regression that allows for target values that have error distribution models other than a normal distribution.
Decision Tree Regressor
Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features.
Isotonic Regression
Models the isotonic relationship of a sequence of observations by fitting a free-form line to the observations under the following constraints: the fitted free-form line must be non-decreasing everywhere, and it must lie as close to the observations as possible.
Export node
Data Asset Exporter
Exports a new data set with the transformed data in it, in the same location as the source data set.