MongoDB World 2017: Using R for Advanced Analytics with MongoDB

Jane Uyvova gave a talk on analytics using MongoDB and the R statistical programming.  She began her talk by discussing analytics versus data insight.  R has become a standard for analyzing data due to its open source nature and easy licensing requirements versus some legacy tools, such as SAS or SPSS.

Use Cases

  • Churn Analysis
  • Fraud Detection
  • Sentiment Analysis
  • Genomics

Use Case 1: Genomics

The human genome consists of billions of gene pairs.  The dataset that was used came from HAPMAP.

  • HapMap3 was the dataset
  • Bioconductor was the R library that was used for this analysis
  • R-Studio was used for the analysis
  • MongoLite connector

The MongoDB data aggregation framework was used to aggregate the data by region.

In doing genomic analysis, schema design becomes important in making the analysis easier and more effective.

Use Case 2:  Vehicle Situational Awareness

  • Chicago open data was used as the dataset
  • The dataset was loaded into MongoDB and Compass was used for the initial analysis
  • R was used to analyze the data.  R was used to extract data for a density plot (GG-Plot)
  • The MongoDB flexible schema allows a wide variety of data to be included in the analysis

One issue that must be addressed is scalability.  Since R is a single-threaded application, data scientists come up against data volume constraints.  One solution to this is to use Spark to parallelize and scale R.

A MongoDB/Spark architecture can include an operational component.  This operational component consists of an application cube and a MongoDB driver.  The data management component consists of the MongoDB cluster.