Table of contents

Refine data in DSX Local (Beta)

Before you analyze local data sets in DSX Local, you can refine the data in the following ways:

  • Cleanse the data: fixing or removing data that is incorrect, incomplete, improperly formatted, or duplicated.
  • Shape the data: customize data by filtering, sorting, combining, or removing columns, and performing operations.

You can refine the data in real-time to build a customized data flow, and then save the data flow as either a separate JSON file or an R script.

In the Data Refinery, you can perform the following tasks:

Create a data flow

To build a new data flow that refines your data, complete the following steps:

  1. Go to your project assets page and click Refine next to the local data set.

  2. Click the Data tab. Note that for large data sets, you will only see a subset of the rows even though the operations will run on the entire data set.

    Data Refinery

  3. Click the github_commitproject icon in a column and select an operation to cleanse, shape, or enrich the column. Alternatively, click Operations to browse operation categories or search for a specific operation. You can also manually enter R code in the command line, and let autocomplete assist you with the syntax.

  4. Each operation you add to the data flow appears as a step under the Steps pane. You can click each step for a snapshot of how the operation will refine the data. You can also undo, redo, edit, or delete steps. To insert an operation between two existing steps, click the step before the position and then apply the new operation.

  5. To save the data flow as a separate JSON file in the Data Flows section of your assets page, click the Save Data Flow icon and click Save Data Flow. To save the flow as an R script in the Scripts section of your assets page, click the click the Save Data Flow icon and click Save R Script.

Frequently used operations

Calculate
Perform a calculation with another column or with a specified value. The operators are:
  • Addition
  • Subtraction
  • Multiplication
  • Division
  • Modulus
  • Exponentiation
  • Is equal to
  • Is not equal to
  • Is less than
  • Is less than or equal to
Convert column type
Convert the type of the column to a different data type. You can create a new column to hold the result of this operation or you can overwrite the existing column values.

Tip: A column's data type determines the operations you can perform. Changing the data type can affect which operations are relevant for that column.

Filter
Filter rows by the selected columns. Keep rows with the selected column values; filter out all other rows.

The filter operators for numeric, string, and Boolean (logical) columns are:

Operator Numeric String Boolean
Is between two numbers
Is empty
Is not empty
Is equal to
Is not equal to
Is less than
Is less than or equal to
Is greater than
Is greater than or equal to
Is in
Is not in
Is null
Is not null
Math
You can apply math operations only to numeric columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.
Absolute value
Get the absolute value of a number.
Example: The absolute value of both 4 and -4 is 4.
Arc cosine
Get the arc cosine of an angle.
Ceiling
Get the nearest integer of greater value, also known as the ceiling of the number. Examples: The ceiling of 2.31 is 3. The ceiling of -2.31 is -2.
Exponentiation
Get a number raised to the power of the column value.
Floor
Get the nearest integer of lesser value, also known as the floor of the number. Example: The floor of 2.31 is 2. The floor of -2.31 is -3.
Round
Get the whole number nearest to the column value. If the column value is a whole number, return it.
Square root
Get the square root of the column value.
Remove
Remove the selected column.
Rename
Rename the selected column.
Sort ascending
Sort rows by the selected column in ascending order.
Sort descending
Sort rows by the selected column in descending order.
Substitute
Obscure sensitive information from view by substituting a random string of characters for the actual data in the selected column.
Text
You can apply text operations only to string columns. You can create a new column to hold the result of an operation or you can overwrite the existing column.
Collapse spaces
Collapse multiple, consecutive spaces in the text to a single space.
Concatenate string
Link together any string to the text. You can prepend the string to the text, append the string to the text, or both.
Lower case
Convert the text to lower case.
Number of characters
Return the number of characters in the text.
Pad characters
Pad the text with the specified string. Specify whether to pad the text on the left, right, or both the left and right.
Title case
Convert the text to title case.
Trim quotes
Remove single or double quotation marks from the text.
Trim spaces
Remove leading, trailing, and extra spaces from the text.
Upper case
Convert the text to upper case.

Cleanse operations

Convert column value to missing
Convert all column values to missing values or convert only the specified column value to a missing value.
Remove duplicates
Remove rows with duplicate column values.
Remove empty rows
Remove rows that have a blank or missing value for the selected column.
Replace missing values
Replace missing values in the column with a specified value or with the value from a specified column in the same row.
Replace substring
Replace the specified substring with the specified text.

Organize operations

Concatenate
Concatenate the values of two or more columns.
Join
Not supported.
Split column
Split the column by non-alphanumeric characters, position, pattern, or text.

Validate data

Click the Profile tab for frequency and summary statistics for each of your columns. Hover over a bar to see the data values.

Frequency is the number of times that a value, or a value in a specified range, occurs. Each frequency distribution (bar) shows the count of unique values in a column. Review the frequency distribution to find anomalies in your data. If you want to cleanse your data of those anomalies, simply remove the values.

Statistics are a collection of quantitative data. The statistics for each column show the minimum, maximum, mean, and number of unique values in that column. Depending on a column's data type, the statistics for each column will vary slightly. For example, statistics for a column of data type integer have minimum, maximum, and mean values while statistics for a column of data type string have minimum length, maximum length, and mean length values.

Visualize data

Click the Visualizations tab to input columns and view data visualizations that provide insights and uncover patterns, trends, and correlations in your data.