Table of contents

Manage models on HDFS with Hadoop integration utility methods

Watson Studio Local provides Hadoop integration utility methods to manage models on HDFS, including building models on Hadoop (where the data resides) and pulling models into Watson Studio Local for saving (without needing to pull all of the data). For usage details, see the sample notebooks.

dsx_core_utils

Use the following utility methods in a Watson Studio Local session and notebook.

def get_hdfs_model_info(webhdfsurl, model_name, version=-1, source_hdfs_dir=None):
    """
    Return basic metadata about the serialized model (on HDFS) for the
    specified model name, if a serialized model can be found.

    :param webhdfsurl Web HDFS URL for a remote Hadoop system on which to
           operate.

    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the source HDFS dir (see below) and can
           therefore be a relative path.

    :param version: (Optional) Remote model version on which to operate. If
           not specified, defaults to latest version.

    :param source_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :returns: A metadata object (JSON / dict) holding basic information about
           the serialized model object, if it was serialized successfully.
    """

def load_model_from_hdfs(webhdfsurl, model_name,
    version=-1, source_hdfs_dir=None, model_load_func=None):
    """
    Find the serialized model (on HDFS) for the specified model name and, if
    it exists, read the model into memory.

    :param webhdfsurl Web HDFS URL for a remote Hadoop system on which to
           operate.

    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the source HDFS dir (see below) and can
           therefore be a relative path.

    :param version: (Optional) Remote model version on which to operate. If
           not specified, defaults to latest version.

    :param source_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :param model_load_func: (Optional) User-specified function to use for
           loading the model into memory. If specified the function must
           be defined to accept a single argument, **`staging_path`**,
           which is a temporary local path from which the model can be loaded.
           If not specified, a default model load function will be used based
           on the model type stored in the HDFS model metadata.

    :returns: An in-memory object representing the model that was read de-
           serialized from a path on HDFS.
    """

def write_model_to_hdfs(webhdfsurl, model, model_name,
    target_hdfs_dir=None, with_new_ver=True, model_write_func=None):
    """
    Serialize the specified model to a location on HDFS and correlate that
    serialized model with the specified model name.

    :param webhdfsurl Web HDFS URL for a remote Hadoop system on which to
           operate.

    :param model: In-memory model on which to operate.
    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the target HDFS dir (see below) and can
           therefore be a relative path.

    :param target_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :param with_new_ver: (Optional) Whether or not to write the model with
           a new version number. Defaults to `True`.

    :param model_write_func: (Optional) User-specified function to use for
           writing the model to disk. If specified the function must be
           defined to accept two arguments: **`model`** (the in-memory object
           to be serialized) and **`staging_path`** (a temporary local path
           to which the object *must* be serialized). If not specified, a
           default model write function will be used based on the python type
           of the given in-memory model.

    :returns: A metadata object (JSON / dict) holding basic information about
           the serialized model object, if it was serialized successfully.

def push_saved_model_to_hdfs(webhdfsurl, saved_model_name,
    saved_model_version=-1, target_hdfs_dir=None, with_new_ver=True):
    """
    Push (copy) a model that has already been saved locally (esp. via
    dsx_ml.save(...)) to a location on the specified remote HDFS.

    :param webhdfsurl Web HDFS URL for a remote Hadoop system on which to
           operate.

    :saved_model_name Name of a local, **already-saved** model that exists
           within the **current** project.

    :saved_model_version: (Optional) Version of the saved model on which
           to operate. If not specified, defaults to latest version.

    :param target_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :param with_new_ver: (Optional) Whether or not to write the model with
           a new version number. Defaults to `True`.

    :returns: A metadata object (JSON / dict) holding basic information about
           the newly-pushed model that was written to HDFS.
    """

hi_core_utils

Use the following utility methods in a remote livy session.

def get_hdfs_model_info(model_name, version=-1, source_hdfs_dir=None):
    """
    Return basic metadata about the serialized model (on HDFS) for the
    specified model name, if a serialized model can be found.

    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the source HDFS dir (see below) and can
           therefore be a relative path.

    :param version: (Optional) Remote model version on which to operate. If
           not specified, defaults to latest version.

    :param source_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :returns: A metadata object (JSON / dict) holding basic information about
           the serialized model object, if it was serialized successfully.
    """
    return hi_util.get_hdfs_model_info(model_name, version, source_hdfs_dir)

def load_model_from_hdfs(model_name,
    version=-1, source_hdfs_dir=None, model_load_func=None):
    """
    Find the serialized model (on HDFS) for the specified model name and, if
    it exists, read the model into memory.

    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the source HDFS dir (see below) and can
           therefore be a relative path.

    :param version: (Optional) Remote model version on which to operate. If
           not specified, defaults to latest version.

    :param source_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :param model_load_func: (Optional) User-specified function to use for
           loading the model into memory. If specified the function must
           be defined to accept two arguments: **`hdfs_path`** and
           **`staging_path`**.  The arguments point to the serialized model
           on HDFS and also on the local file system of the Hadoop "driver"
           node, respectively.  The user-given function can use whichever of
           those paths is appropriate for its operation / model type. If not
           specified, a default model load function will be used based on the
           model type stored in the HDFS model metadata.

    :returns: An in-memory object representing the model that was read de-
           serialized from a path on HDFS.
    """

def write_model_to_hdfs(model, model_name, target_hdfs_dir=None,
    with_new_ver=True, model_write_func=None):
    """
    Serialize the specified model to a location on HDFS and correlate that
    serialized model with the specified model name.

    :param model: In-memory model on which to operate.
    :param model_name: Name of an **HDFS** model on which to operate. This
           name is evaluated w.r.t. the target HDFS dir (see below) and can
           therefore be a relative path.

    :param target_hdfs_dir: (Optional) HDFS directory in which to operate.
           If not specified, defaults to `/user/<user>/.dsxhi/models/`.

    :param with_new_ver: (Optional) Whether or not to write the model with
           a new version number. Defaults to `True`.

    :param model_write_func: (Optional) User-specified function to use for
           writing the model to disk. If specified the function must be
           defined to accept three arguments: **`model`**, **`hdfs_path`**
           and **`staging_path`**.  The first argument points to the in-memory
           object to be written.  The 2nd and 3rd arguments hold a remote
           (HDFS) target path and a local (on the Hadoop "driver" node) path,
           respectively.  The user-given function can use whichever **one** of
           those paths is appropriate for its operation / model type.  (If
           both paths are used, the `staging_path` will take precedent; the
           `hdfs_path` will be ignored.)  If not specified, a default model
           write function will be used based on the python type of the given
           in-memory model.

    :returns: A metadata object (JSON / dict) holding basic information about
           the serialized model object, if it was serialized successfully.
    """

def run_command(command, sleep_after=None, echo_output=True):
    """
    Execute a specified command using Popen, wait for the command to complete,
    then optionally echo the command output (via "print()") before returning it.

    :param command: System command to execute within the Yarn container of
           the Hadoop "driver" node.

    :param sleep_after: (Optional) Amount of time, in seconds, to wait after
           executing the command. Defaults to `None` (i.e. don't wait).

    :param echo_output: (Optional) Whether or not to echo (via "print") the
           output of the command, if there is any.  Defaults to `True`.

    :returns: The output from the executed command, as a string, or None if
           there was no output (or if the output was an empty string).
    """