Spark Canvas flows add-on (Beta)
You can build a machine learning model as a flow by using the Spark Canvas to conveniently transform data, train the model, and evaluate it. For a compatibility report of data sources supported by Spark Canvas in Watson Studio Local, see Software Product Compatibility.
A Spark Canvas runtime is created for each user per project. Each runtime consumes 1 CPU and 5GB of memory.
Install the Spark Canvas flows add-on
To install the Spark Canvas flows add-on into Watson Studio Local, complete the following steps:
- Sign in as root to the first master node of the Watson Studio Local cluster as a user who is a member of the docker group for all nodes of the cluster.
- On the first master node, download the Spark Canvas installation package spark-canvas-1.2.1.tgz.
- Extract the contents of the Spark Canvas installation package: tar xvf spark-canvas-1.2.1.tgz.
- In the Spark Canvas directory, run
Build a model with the Spark Canvas
To build a machine learning flow by using the Spark Canvas, complete the following steps:
- Open a project in the Watson Studio Local client.
- From your project, click the Assets tab and click Modeler Flows.
- Click Add Modeler Flow.
- Click the Blank tab and select Scala Spark 2. For the Spark runtime, select either the Watson Studio Local Spark or any registered Hadoop system that has Spark 2. Alternatively, you can click the From File tab to import a preexisting .STR file.
- Name and describe your machine learning flow.
- Click the Create button. The Spark Canvas tool opens so that you can build your flow.
- Add the data from your project to the Spark Canvas. Click the Find and Add Data icon ( ) for a list of the data sets or connections to choose from.
- Open the node palette by clicking the palette icon ( ).
- From the node palette, select a node and drag it to the Spark Canvas. See The node palette for descriptions.
- From the Spark Canvas, double-click a node to specify its properties.
- Draw a connector from the data set to the node.
- Continue to add operators or other nodes as needed to build your model.
Options for building a model
- You can run any terminal node within the Spark Canvas without running the entire model. Right-click the node and select Run.
- To view the results of an Outputs node, run the node, such as a table node, and then click the View outputs and versions icon ( ). In the side palette, on the Outputs tab, click the object, such as a table, to open it.
The node palette
- Transformations nodes
- These transformation nodes enable you to work with data directly so that you can create files
that are easier to manipulate. You can filter unnecessary data rows or fill out a data set by adding
a column of data.
- Filter Rows
- Filter rows based on criteria. Set a condition by using a Spark SQL expression to create a new data set.
- Select Columns
- Select specific columns of data for work later in the machine learning flow. Other columns are excluded from this working data set.
- Add Column
- Use the Spark SQL expression language to combine existing columns to create new columns or to create a new column of constant values.
- Rename Column
- Rename a column to better reflect the content or to match the same data in another table.
- Sample Rows
- Create a random sample of rows, with or without replacement, with the option to provide a limit on the sample size.
- SQL Transform
- Transform data by running an arbitrary SQL statement. For example, In the situation where there
aren't any key columns in common, a dummy key column can be fabricated to allow a join to be
constructed. The SQL transformer can be connected to any node and treats the incoming dataframe as a
SQL table that can be queried and augmented with new fields based on the SQL executed. This incoming
dataframe can be referenced in queries using the table name
- Select Distinct Rows
- Specify a number of unique keys to return one record from each of the resulting groups.
- Modeling nodes
- Logistic Regression
- Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range.
- Decision Tree Classifier
- Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both binary and multiclass labels, as well as both continuous and categorical features.
- Random Forest Classifier
- Constructs multiple decision trees to produce the label that is a mode of each decision tree. It supports both binary and multiclass labels, as well as both continuous and categorical features.
- Gradient Boosted Tree Classifier
- Produces a classification prediction model in the form of an ensemble of decision trees. It only supports binary labels, as well as both continuous and categorical features.
- Linear Regression
- Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
- Generalized Linear Regression
- Generalization of ordinary linear regression that allows for target values that have error distribution models other than a normal distribution.
- Decision Tree Regressor
- Maps observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It supports both continuous and categorical features.
- Isotonic Regression
- Models the isotonic relationship of a sequence of observations by fitting a free-form line to the observations under the following constraints: the fitted free-form line must be non-decreasing everywhere, and it must lie as close to the observations as possible.
- Export node
- Data Asset Exporter
- Exports a new data set with the transformed data in it, in the same location as the source data set.