Complex event processing (CEP) engines are utilized for rapid and large-scale data processing in real time. Some examples of CEPs used in industry are generating online music recommendations (done by companies such as Pandora and Spotify), streaming fraud detections necessary for credit card companies and maintaining network security.
One open source CEP solution is the Apache Spark framework. Typically the speed of running programs in the Apache Spark paradigm is much faster than an equivalent MapReduce application. An increase in performance is obtained by leveraging computations in-memory. For example, Spotify generates streaming music recommendations on terabytes of data for their customers daily. Moving to the Spark ecosystem from the Hadoop+MapReduce paradigm resulted in a five-time increase in the speed of generating the recommendations, helping the company to improve the personalization of the music listened to by the user.1
With the explosion of data and the growing need for understanding the relationships that exist among the data points, Spark is increasingly being leveraged by the scientific community. AMP Lab, a research group at UC Berkeley, developed Scalable Nucleotide Alignment Program (SNAP) in coordination with Microsoft and University of California, San Francisco. SNAP reduces the RNA sequencing time required for cancer diagnosis from days to a few hours with Spark, thus contributing to meaningful results--a notable one being saving a boy’s life with a timely diagnosis.2
Meanwhile, Spark has been used by Dr. James Freeman and his group at Howard Hughes Medical Institute (HHMI) Janelia Farm Research Campus. They have studied the behavior of neurons in zebrafish to link the reactions of the zebrafish to different stimulus with time. The study resulted in terabytes of data that varies with time. To handle the load and complexity of the data, the group used PySpark (a Python API for Spark) combined with established Python modules to explore global patterns of neural activity while using Spark streaming in a real-time processing environment during the experiment. As a result of the HHMI endeavors, their developed code module, dubbed Thunder,3 is being used by communities for applications related to the Internet of Things.
Spark is being adopted by several companies in the industry. To mention a few, Guavus4 has built its operational intelligence platform on Spark, Zoomdata5 is using SparkSQL to do business intelligence-style analytics and Graphflow6 has used Spark to build a real-time recommendation and customer intelligence platform. The Spark engine is a multi-faceted tool that provides a suite of packages to build a variety (online streaming, batch processing, machine learning, etc.) of applications. The future of CEP is looking bright and can provide the rapid insight that many companies desire at scale while filling another gap in the overall analytics framework.
For more information, contact me at email@example.com.
Sanghamitra Deb is a senior analyst for research and development at Accenture Technology Labs in San Jose. She works as a data scientist in the Data Insights group. Her primary expertise is around data munging and data modeling.
Christopher Johnson, Music Recommendations at Scale with Spark, Spotify, Spark Summit 2014.
David Patterson, Spark Meets Genomics: Helping Fight the Big C with the Big D, AMP Lab UC Berkeley, Spark Summit 2014.
Freeman et al., Mapping Brain Activity at Scale with Cluster Computing, Nature, 2014.
Eric Carr, Building Big Data Operational Intelligence Platform with Apache Spark, Guavus, Spark Summit 2014.
Justin Langseth, BI-style Analytics on Spark (without Shark) using SparkSQL and SchemaRDD, Zoomdata, Spark Summit 2014.
Nick Pentreath, Using Spark and Shark to Power a Real-time Recommendation and Customer Intelligence Platform, Graphflow, Spark Summit 2014.