Table of contents

RStudio overview

R is a popular statistical analysis and machine-learning package that enables data management and includes tests, models, analyses, and graphics, and enables data management. RStudio, included in Data Science Experience, provides an IDE for working with R.

An RStudio session created in DSX includes 2 GB of storage and 5 GB of memory available for your use.

For information about how to set up and start using RStudio, see the blog post Using RStudio in IBM Data Science Experience, and Using RStudio on the RStudio Support site.

Restriction: To connect to relational data sources from RStudio from a DSX cluster without Internet access, the DSX administrator must copy the required packages to your RStudio pods and then follow the installation steps specific to installing from a downloaded package. If the DSX cluster has Internet access, complete the following steps:

  1. In the Tools shell, install the database driver in the /user-home/ directory. Example:

    pwd
    /user-home/1003/DSX_Projects/project-nb-test/rstudio 
    cd /user-home/1003/;wget https://jdbc.postgresql.org/download/postgresql-42.2.0.jar
    
  2. Configure Java on the pod:

    R CMD javareconf
    
  3. Return to the RStudio script and install the RJDBC module and dependencies:

    install.packages("RJDBC",dep=TRUE)
    

PostgreSQL example:

library(RJDBC)

driverClassName <- "org.postgresql.Driver"
driverPath <- "/user-home/1003/postgresql-42.2.0.jar"
url <- "jdbc:postgresql://9.876.543.21:27422/compose"
databaseUsername <- "admin"
databasePassword <- "ABCDEFGHIJKLMNOP"
databaseSchema <- "public"
databaseTable <- "cars"
drv <- JDBC(driverClassName, driverPath)
conn <- dbConnect(drv, url, databaseUsername, databasePassword)
#dbListTables(conn)
data <- dbReadTable(conn, databaseTable)
#data <- dbReadTable(conn, paste(databaseSchema,'.',databaseTable, sep='')
data

Change Spark version

sparklyr library

To change the default Spark 2.0.2 service to a Spark 2.2.1 service, use the spark_connect() function.

To connect to the Spark 2.2.1 cluster:

sc <- spark_connect( master = "spark://spark-master221-svc:7077", spark_home="/usr/local/spark-2.2.1-bin-hadoop2.7" )`

To connect to the Spark 2.0.2 cluster:

sc <- spark_connect( master = "spark://spark-master-svc:7077" )

SparkR library

To change the default Spark 2.0.2 service to a Spark 2.2.1 service, use the $SPARK_HOME environment variable to specify the Spark 2.2.1 installation location in RStudio:

Sys.setenv("SPARK_HOME"="/usr/local/spark-2.2.1-bin-hadoop2.7")
# import SparkR
library(SparkR, lib.loc = "/usr/local/spark-2.2.1-bin-hadoop2.7/R/lib")
# initial sc
sc = sparkR.session(master="spark://spark-master221-svc:7077", appName="dsxlRstudioSpark221")

See SparkR (R on Spark) for more information.

Learn more

Read and write data to and from IBM Cloud object storage in RStudio