Jane Uyvova gave a talk on analytics using MongoDB and the R statistical programming. She began her talk by discussing analytics versus data insight. R has become a standard for analyzing data due to its open source nature and easy licensing requirements versus some legacy tools, such as SAS or SPSS.
- Churn Analysis
- Fraud Detection
- Sentiment Analysis
Use Case 1: Genomics
The human genome consists of billions of gene pairs. The dataset that was used came from HAPMAP.
- HapMap3 was the dataset
- Bioconductor was the R library that was used for this analysis
- R-Studio was used for the analysis
- MongoLite connector
The MongoDB data aggregation framework was used to aggregate the data by region.
In doing genomic analysis, schema design becomes important in making the analysis easier and more effective.
Use Case 2: Vehicle Situational Awareness
- Chicago open data was used as the dataset
- The dataset was loaded into MongoDB and Compass was used for the initial analysis
- R was used to analyze the data. R was used to extract data for a density plot (GG-Plot)
- The MongoDB flexible schema allows a wide variety of data to be included in the analysis
One issue that must be addressed is scalability. Since R is a single-threaded application, data scientists come up against data volume constraints. One solution to this is to use Spark to parallelize and scale R.
A MongoDB/Spark architecture can include an operational component. This operational component consists of an application cube and a MongoDB driver. The data management component consists of the MongoDB cluster.