Table of contents

Extract features from text data in a Jupyter notebook

To utilize text data in a machine-learning model, retrieve numeric values from unstructured text data. You can do this by using the Watson™ Explorer Feature Extractor API that generates a vector of terms from a text using a Watson Explorer collection and its features. The API accepts a Spark DataFrame as input and appends columns of the vector values for each feature.

For more information on the API, see Watson Explorer Python API.

Create a sample notebook

Load a dataset to Spark and insert a datase tcode snippet.

  1. Create a Jupyter notebook.
  2. Click Find and Add Data (Finad and add data) in the toolbar, and then select the Local tab or Remote tab.
  3. Select Insert to code of the dataset you want to use.
  4. Select Insert Spark DataFrame in Python.
  5. Insert a code snippet of Watson Explorer Feature Extractor API and configure it.
    # Add asset from file system
    df_data_1 = SQLContext(sc).read.csv(os.environ['DSX_PROJECT_DIR']+'/datasets/sample_data.csv', header='true', inferSchema = 'true')
    df_data_1.show(5)
  6. Click Find and Add Data (Finad and add data) in the toolbar, and then select the Other tab.
  7. There are menus for each Watson Explorer collection. Select Insert to code of a collection you want to use, and then select Insert Watson Explorer Feature Extraction.
    insert code
  8. In the inserted code snippet, replace <INPUT_COLUMN_NAME> with the actual input column name. For other parameters, refer to Watson Explorer Python API.
    #In earlier cells, there has to be insert code for WEX collection
    #Insert code for WEX feature extraction
    from ibmwex.ml import FeatureExtractor, OutputColumn
    extractor_1 = FeatureExtractor() \
    .setCollectionId("00000000-0000-0000-0000-000000000000") \
    .setInputCol("<INPUT_COLUMN_NAME>") \
    .setOutputCols(OutputColumn("words", "._word"))
  9. You can use extractor_1 to extract features from the input column of the dataset. The following example displays 5 rows that contains "words" columns which stores the feature vector.
    extractor_1.transform(df_data_1).show(5)
  10. Now you can treat text data as structured data. You can also train and save a machine learning model with Watson Explorer Feature Extractor API, and deploy it to Model Management and Deployment as model development.

The example code is shown here.

# Prepare data
train, test = df_data_1.randomSplit([0.8, 0.2], 12345)

# Prepare pipeline
productIndexer = StringIndexer(inputCol="claim_product", outputCol="claim_product_index")
productEncoder = OneHotEncoder(inputCol=productIndexer.getOutputCol(), outputCol="claim_product_vector")
productLineIndexer = StringIndexer(inputCol="claim_product_line", outputCol="claim_product_line_index")
productLineEncoder = OneHotEncoder(inputCol=productIndexer.getOutputCol(), outputCol="claim_product_line_vector")
assembler = VectorAssembler( \
    inputCols=[productEncoder.getOutputCol(), productLineEncoder.getOutputCol(), wordCol.getName(), phraseCol.getName()],
    outputCol="features"
)

label = StringIndexer(inputCol="label", outputCol="label_index", handleInvalid="skip")
labelModel = label.fit(train)

# Create ML model for classification
classifier = NaiveBayes(labelCol=label.getOutputCol(), featuresCol=assembler.getOutputCol())
labelDecoder = IndexToString(inputCol=classifier.getPredictionCol(), outputCol="prediction_label", labels=labelModel.labels)

pipeline = Pipeline(stages=[extractor_1, label, productIndexer, productEncoder, productLineIndexer, productLineEncoder, assembler, classifier, labelDecoder])
model = pipeline.fit(train)


# Evaluate the model
predicted = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol=classifier.getLabelCol(), predictionCol=classifier.getPredictionCol(), metricName="accuracy")
accuracy = evaluator.evaluate(predicted)
print("Accuracy:%g" % accuracy )

predicted.select(label.getInputCol(), labelDecoder.getOutputCol()).toPandas()[0:10]

# Save the trained model
from dsx_ml.ml import save

save(name = 'WEXClassificationModel',
     model = model,
     test_data = test,
     algorithm_type = 'Classification',
     source = 'ing+Watson+Explorer+for+Classification.ipynb',
     description = 'Document classification using WEX Feature Extractor'
    )