Create a Watson Explorer collection
The first step to analyze text data using Watson™ Explorer is to create a Watson Explorer collection that contains the entire set of sources for text analysis like Natural Language Processing (NLP). You can create a collection using the integrated Watson Explorer Content Miner.
In your project assets page, click Watson Explorer collections and click Add Watson Explorer Collection.
In the first page, you provide a name and a description for your collection. The collection creation wizard guides you through the rest of the collection creation process. These steps are described below.
Add a data set to your collection
You can select an existing Watson Studio Local data set that has already been defined. In this offering, you can process up to 100,000 documents. If you specify a data set which has over 100,000 documents, only first 100,000 documents will be imported into this collection.
- Watson Studio Local Data Sets (Local files)
- To upload CSV files to Watson Studio Local, see Access data from local files.
Existing local files will be displayed as radio buttons. Select a file you want to import into this collection. For instructions after pressing Import, see Importers.
- Watson Studio Local Data Sets (Remote)
- You can use Watson Studio Local data sets that are based on remote data sources. Only IBM® Db2® and Oracle sources are supported. To define a remote data set, perform the following steps.
Existing remote data sets will be displayed as radio buttons. Select the data set you want to import into this collection. If the remote data set requires a primary key for crawling, input it into the Primary key (optional) field.
After you press Start crawling, the data set is crawled for data. When the crawl has completed, you can proceed to the next step.
Configure collection fields
Select the body field, which is typically used by applications. For advanced usages, you can further configure the fields after creating a collection from the Configure collection page.
Enrich your collection
Enrichment is a process to generate annotations from unstructured text content. Only existing annotations are listed here, but you can create and apply more later. Enrichments selected here are applied to analyzable text fields (body and title fields in typical collections).
- Select annotators to be enabled for this collection. Selected annotators enrich the body text content. The Part of Speech annotator is selected by default. For more information, see Annotators.
- Language identification
- Specify how a language used in the enrichment process applied to text content is determined.
Choose automatic detection or a specific language. The following languages are supported.
- Arabic, Czech, Danish, German, English, Spanish, French, Hebrew, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Romanian, Russian, Slovak, Turkish, Chinese
Save your collection
You can select Enable Domain Adaptation Curator to facilitate natural language processing in Watson Explorer Content Miner. For more information, see Domain Adaptation Curator .
The indexing process starts after you save the collection.