Table of contents

Machine learning models

The DSX Local client provides tools to help you create and train machine learning models that can analyze data assets and extract value from them. Users can also deploy their models to make them available to a wider audience.

Tasks you can perform:

DSX Local supports the following machine learning model types:

  • Spark ML
  • PMML with online scoring
  • Custom models with batch scoring
  • scikit-learn 0.19.1 (Python 2.7 and Python 3.5) - 0.19.1 (GPU-Python 3.5) with pickle or joblib format
  • XGBoost 0.7.post3 (Python 2.7 and 3.5) - 0.71 (GPU-Python 3.5)
  • Keras 2.1.3 (Python 2.7 and Python 3.5) - 2.1.5 (GPU-Python 3.5)
  • TensorFlow 1.5.0 (Python 2.7 and Python 3.5) - 1.4.1 (GPU-Python 3.5)
  • WML

Want to see the machine learning process in action? Watch this short video:

Figure 1. Video iconMachine Learning Process in Data Science Experience Local
This video illustrates the machine learning process in DSX Local 1.1.3 by showing an example of how you can create a model in a notebook and easily train, test, score, evaluate, and deploy it.

Create a model with APIs

DSX Local provides sample notebooks to help users create their own custom applications that can be powered by machine learning.

To learn more about the machine learning repository client API commands and syntax that you can use inside a notebook, see dsx_ml and dsx_core_utils.

Note that for DSX Local, the repository URL is static. Also, users do not need to authenticate to the repository in DSX Local.

Restriction: When you insert a dataframe for a model, only CSV and JSON are supported.

For each scoring iteration that you run from a notebook, DSX Local automatically increments the model version. Later in the model details page, you can compare the accuracies of each version you ran.

Accuracy history

Create a model from a file

You can import three types of models:

PMML
An XML file written in the Predictive Model Markup Language. The PMML is scored using JPMML Evaluator (make sure you review the "Supported model types", "Not yet supported model types" and "Known Limitations").
Custom Batch
A third-party vendor model in a compressed .gz format that will perform batch scoring. If your model accesses input data from Apache Hadoop, then you must adjust your scripts inputs accordingly. See the Carolina for Hadoop User Guide for details. Requirement: You must ZIP up all scripts and dependent models into a single .gz file before you import it. You can use the utility script provided at /user-home/.scripts/common-helpers/batch/custom/createcustombatch.sh for this.
Custom Online
A third-party vendor model in a .jar format that will perform online scoring. Requirement: You must run your scripts through the Carolina tool to generate the .jar file. See Carolina for Integration, Carolina API docs, and Carolina API examples for details. To perform the online scoring, you need to: a) copy all of the .jar files from the Carolina-for-Integration lib folder to the /user-home/_global_/libs/ml/mlscoring/thirdparty/carolina/ folder of DSX, b) place the carolina.lic file in the aforementioned DSX folder, and c) restart the ml-scoring service.

To import a model into your project from a file, complete the following steps:

  1. In your project, go to the Assets page and click add models. Alternatively, you can click Add model from the project pull-down menu.

  2. In the Add Model panel, click the From File tab.

  3. Specify the name and description.

  4. In the Type file, select what kind of model you are importing. Browse to the file or drag it into the box.

  5. Click the Create button.

Create a model with the model builder

Recommendation: If your model data exceeds 750 MB, use a Jupyter notebook instead to create and train the model. Otherwise, the model builder might time out during the model training.

Want to see the visual model builder in action? Watch this short video:

Figure 2. Video iconMachine Learning in DSX Local
This video provides an overview of training machine learning models in DSX Local.

To create a new model in your project, complete the following steps:

  1. In your project, go to the Assets page and click add models. Alternatively, you can click Add model from the project pull-down menu.

    Shows the project icons

  2. In the Add Model panel, click the Blank tab. Specify the name and description. Select Machine Learning. Select whether you want to create the model automatically or manually and click Create to create an untrained model. If you opt to create the model manually, DSX Local will provide a Prepare panel where you can add and configure more transformers than just the default. A transformer acts on the data, usually by appending new columns and mapping existing data to the new column.

    Create Model

  3. Click your newly created model and select the data asset to run the model on. You can also add new data assets. Click Next to load the data.

    Select Data

  4. If you selected Manual when you created the model, then add and configure each transformer accordingly. Click Next.

    Prepare

  5. Select the column value to predict and the technique to train it with (DSX Local will suggest the best one). You can add estimators to train on the data and produce a model for each one; then you can select the best trained model to deploy and use for predictions. You can also adjust the validation split to experiment with how much of the data to train, test, and hold out. Click Next to train and evaluate the model.

    Train

  6. Select which trained model you want to keep and click Save to save it. Each time you save the model, its version is incremented. Later in the model details page, you can compare the accuracies of each version you ran.

    Accuracy history

    You can also select the best version to publish.

Test a model online

In the Assets page of your project, click Test next to the model to input data and simulate predictions on it as a pie chart or bar graph.

Test

Batch score a model

To run batch prediction jobs that read in a data set, score the data, and output the predictions in a CSV file, complete the following steps:

  1. In the Assets page of your project, click Batch Score next to the model.
  2. Select the execution type, input data set, and output data set CSV file. Restriction: WML models created in the visual model builder can only use a remote data set as the input data asset. Other types of models can use a local CSV file for the input.

    Batch score

  3. Click the Generate Batch Script button. DSX Local automatically generates a Python script that you can edit directly in the Result view. Tip: This script can be customized to pre-process your data, for example, ensuring the case of the dataframe headers is suitable for ML models.

  4. Click the Run now button to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project; then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Batch scoring. Restriction: If you select a GPU worker for the job, you can only batch score Keras models.

    Create job

    Requirement: If you are evaluating a PMML model, you must add the environment variable SPARK_VERSION=2.1.

When the job runs, the output CSV file should appear under your Data sets. Click Preview next to it to view the contents. Tip: If the job reports Success but no CSV file was outputted, the job might have actually failed. Validate whether the input table exists by using the remote data set in the notebook.

Batch output

From the job details page, you can click on each run to view results and logs. You can also view a batch scoring history from the model details.

Evaluate a model

To evaluate the performance of a model, complete the following steps:

  1. In the Assets page of your project, click Evaluate next to the model.
  2. Select an input data set that contains the prediction column. For each evaluator, you can opt to customize your own threshold metric and specify what fraction of the overall data must be relevant for the model to be considered healthy. Note that in Spark 2.1 model evaluations, the output data set field is ignored.

    Evaluate

  3. Click the Generate Evaluation Script button. DSX Local automatically generates a Python script that you can edit directly in the Result view. Tip: This script can be customized to pre-process your data, for example, ensuring the case of the dataframe headers is suitable for ML models.

  4. Click the Run now button to immediately create and run a job for the script. Alternatively, you can click Advanced settings to save the script as either a .py script or a .ipynb notebook in your project; then later from the Jobs page of your project, you can create a scheduled job for the script or notebook you saved with Type set to Model evaluation.

    Create job

    Requirement: If you are evaluating a PMML model, you must add the environment variable SPARK_VERSION=2.1.

From the job details page, you can click on each run to view results and logs. Go to the model details page to view the evaluation history.