Open source search engines in the age of Cloud, Analytics, and Cognitive Search

Solr vs. Elasticsearch has been discussed frequently in our client projects and within the enterprise search community. But as traditional enterprise search has evolved into what Gartner calls "Insight Engines," we revisited this topic to provide the latest observations incorporating Cloud, Analytics, and Cognitive Search capabilities to help you evaluate Solr and Elasticsearch.

What is Solr?

Solr is a leading open source search engine from the Apache Software Foundation’s Lucene project. Thanks to its flexibility, scalability, and cost-effectiveness, Solr is widely used by large and small enterprises.

What is Elasticsearch?

Elasticsearch, also based on Lucene, is another leading open source search engine supporting powerful enterprise applications. Elastic - the company behind Elasticsearch and the Elastic Stack - provides enterprise solutions for search, log analytics, and other advanced analytics use cases.

Choosing your open source search engine

Oftentimes, when we help clients perform assessments that revolve around the use of an open source search engine within their enterprise solution, the question is asked: “Which is better, Solr or Elasticsearch?” While there may be a preconceived notion that one is inherently better than the other, the question is more relevant when framed as “Which is better for me?”

There are various search engine technologies available, but the most popular open source variants are those that rely on the underlying core functionality of Apache Lucene, which is, in essence, the piece that makes the search engine work. Solr and Elasticsearch are components on top of the search library providing their own implementations and features for a complete search product. The core functionality of Lucene provides the same experience for basic search functionality across Solr and Elasticsearch, but it’s their implementation approaches surrounding Lucene that create the differentiators.

<<< Start >>>

The role of a search engine has moved beyond effectively finding information to serving a key role in content analytics, predictive modeling, and integration with cognitive/intelligent search features, such as natural language processing (NLP), machine learning (ML), and relevancy scoring. We've explored and implemented these intelligent capabilities in our client work - learn more here.

<<< End >>>

 

Solr vs. Elasticsearch: Which is better for my organization?

Well, it depends.

There are many use cases surrounding the adoption of one technology over another. But when asked this question, I’ll typically reply with an analogy from an operational management perspective: “Solr is like Linux. Elasticsearch is like Windows.” You can heavily customize and tailor Solr to fit your needs, but management and deployment are much more involved and resource-consuming than the effort required with Elasticsearch. Elasticsearch is very easy to deploy, manage, and monitor (using X-Pack) with a very well-designed user interface (Kibana) that allows for data exploration and creation of analytical visualizations, but customizing its functionality is limited and more difficult with the plugin framework.

Elasticsearch could be for you if you want to:

  • Get your search engine up and running quickly with little to no overhead;
  • Begin exploring your data as soon as possible; and
  • Consider analytics and visualizations a core component of your use cases.

Solr may be for you if you:

  • Need to index and reprocess massive amounts of data on a large scale;
  • Have available resources to invest in managing Solr and the tools available for interaction; and
  • Have an existing enterprise framework that is built to work with Solr (like other Apache products such as Hadoop, or enterprise frameworks like Cloudera, Hortonworks, or HDInsights built on Hadoop).

This is not to say that a Hadoop platform cannot work with Elasticsearch (we have proposed this scenario to clients), but some platforms, Cloudera and Hortonworks in particular, provide additional tools and methodologies for indexing data and managing Solr within the ecosystem (which is especially the case with the upcoming release of Cloudera’s CDH 6 supporting Solr 7).

 

Solr vs. Elasticsearch: Feature comparison

From experience, we've seen that assessments can provide tremendous value in helping clients define strategies and implementation roadmaps. During our assessments, we conduct a search engine comparison matrix that evaluates the suitability of a search engine against a particular client’s needs and use cases with a weighted scoring mechanism applied based on the priority of certain features. Based on this analysis, there are common features and use cases that serve as points of interest when making an overall recommendation for a search engine.

<<< Start >>>



<<< End >>>

The chart below captures some of the observations about Solr and Elasticsearch:

SOLR ELASTICSEARCH
Use Cases
  • Search for large bulk data sets, for example, healthcare (payer / provider), biopharma research, finance, and government
  • Native unformatted record filter and search, such as e-commerce or customer-facing search
  • Static data set searching
  • Large bulk reprocessing
  • Log analytics: enterprise log consumption and analysis or a replacement option for commercial off-the-shelf log analytics products
  • Real-time dashboards for operational timeline or sales and marketing insights
  • High-volume data streams with natural language content from social media and IoT streams
  • Native unformatted record filter and search (e-commerce, customer)
Visualization Tools
  • Banana (Kibana port) can provide support up to Solr 6.x
  • Apache Hue (mostly used in Hadoop deployments) – emerging functionality with Hue Search App
  • Robust visualization development framework with Kibana
  • Maintained and version-matched by Elastic
  • Well-integrated with Grafana for analytics and monitoring
Cloud and Big Data
  • Cloud-based deployments rely heavily on management tools like Cloudera and Hortonworks
  • Fully-hosted options are available through third-party vendors
  • As an Apache project, Solr integrates well with other Apache products, especially those supported in Hadoop
  • Fully-hosted and managed solutions are provided by all the major cloud infrastructure providers (Microsoft Azure, AWS, Google Cloud)
  • Management tools are provided by the cloud hosting provider
  • Elasticsearch Hadoop libraries allow for the integration of Hadoop components with Elasticsearch natively
Cognitive Search Capabilities and Integration
  • Learning to Rank (LTR) module is supported in Solr 6.4 or later
  • As an Apache project, Solr integrates well with OpenNLP (but not an embedded component) for entity extraction and tagging to feed concept-based search
  • Includes a Machine Learning component (with X-Pack)
  • Allows for pattern recognition and time series forecasting (ML and Kibana)
  • Learning to Rank (LTR) plugin supports machine-learning-driven relevancy tuning exercises
  • Open NLP can be utilized in a similar fashion to Solr as an external component supporting cognitive search functions
Management and Operations
  • Overall, more difficult to manage (though Cloudera Manager helps with this in a Hadoop environment)
  • APIs are not available (though Solr 7 supports metrics APIs, requires JMX)
  • Scaling requires manual intervention for shard rebalancing (Solr 7 has an auto-scaling API giving some control over shard allocation and distribution)
  • Easy to set up and scale
  • Automatic shard rebalancing after node addition
  • APIs provide ease of monitoring and state evaluation
  • X-Pack provides out of the box resource dashboards (requires licensing from Elastic)
Development Architecture
  • Excellent pluggable architecture
  • Plugins can be easily developed and integrated
  • Fully open source with vast community support
  • Tight integration with Lucene development
  • More restrictive plugin architecture
  • Plugins are not supported in hosted environments
  • Recently became fully open source with Elasticsearch core and X-Pack (X-Pack code has been released as open source, but still requires commercial licensing to implement)
  • Lags slightly in implementing new Lucene features
  • Frequent point releases with feature additions
Cluster State Management
  • Zookeeper Quorum: minimum 3 nodes required; 5 to 7 recommended depending on the overall size of the cluster
  • Master Nodes (proprietary solution): minimum 3 nodes required. They can exist as independent nodes or dual-role nodes with data nodes
Security
  • Implemented in 3 flavors: basic (username/password in Zookeeper), Hadoop authentication (LDAP), or Kerberos
  • LDAP / Active Directory is not supported directly
  • Custom plugins can be developed
  • Implemented in 3 flavors: basic (username/password in Zookeeper), Hadoop authentication (LDAP), or Kerberos
  • LDAP / Active Directory is not supported directly
  • Custom plugins can be developed
Bulk Indexing Tools
  • Batch API operations
  • Within Cloudera Hadoop: MapReduceIndexerTool (Solr 4.x); Lily HBase batch indexing; and Spark CrunchIndexerTool
  • MapReduceIndexerTool (5.x) from Lucidworks
  • Bulk API operations only
  • Configuration modifications can be made to speed up initial bulk indexing
Near Real Time (NRT) Indexing
(not a comprehensive list)
  • Beats framework
  • Logstash
  • Ingest Nodes
  • Kafka Connect Elasticsearch Sink
  • Spark Streaming
  • Apache NiFi/MiNiFi
  • Accenture Aspire for unstructured data processing and enrichment
Analytics
  • Strong facet-based analytics
  • JSON facets added to support more dynamic aggregations with analytic functions
  • Stream Expressions are added in Solr 7 to support a streaming framework for parallel computation and result emissions for downstream processing
  • Strong analytic capabilities with aggregations
  • Supports analysis on top of aggregations (e.g. moving averages)
  • Provides time-series analysis of continually added data (like logs or social media streams) for trend and efficacy insights
Nested Data Structures
  • Has the notion of parent-child document relationships
  • These exist as separate documents within the index, limiting their aggregation functionality in deeply-nested data structures
  • Deep nesting is well-supported
  • Fully-structured JSON documents can be directly persisted into Elasticsearch
  • Aggregations can be performed against nested structures easily
Query Operations
  • Mostly limited to query URI parameters, leading to complex queries (debuggable in Solr Admin)
  • JSON API (Solr 7) introduced to allow for JSON based query expressions
  • Request handlers can be simply defined in Solr configuration and Java to perform specific and complex tasks related to a given query use case
  • Full-featured Query DSL for writing and expressing complex queries
  • Limited to only JSON
  • Custom request handlers require the development of a plugin. There is no notion of jar references from a custom endpoint as there is in Solr
API Interaction
  • SolrJ (Java) is the most well maintained and up-to-date version and is maintained as part of the Apache project
  • Other Apache maintained APIs: Flare, PHP, Python, Perl
  • Other language APIs exist but are community maintained, and often lag in functionality behind SolrJ (most notably the .NET API)
  • Many APIs are developed and supported directly by Elastic (Java, JavaScript, Groovy, .NET, PHP, Perl, Python, Ruby)
  • Other community APIs exist for Elasticsearch (e.g. C++, Erlang, Go, Haskell, Lua, Perl, R, etc.)

Choosing between Solr and Elasticsearch? Consider these

Making the decision about which search engine is best for your specific use cases and needs should not be a decision made based on an “either-or” presumption. The overall importance of a particular piece of functionality in Solr may outweigh that of an operational advantage in Elasticsearch, for example:

In one client case, the overhead associated with Solr deployment and having to use an outdated client of SolrNET (at the time) were outweighed by the pluggable nature of Solr. Custom encryption update and request handlers were needed to apply encryption to indexed content using rotating data encryption keys, thereby necessitating the use of Solr over Elasticsearch. The functionality required by the index encryption process was not something that could effectively be implemented within Elasticsearch.

Conversely, when evaluating search engine options for a general search use case without big data or analytics considerations, Elasticsearch becomes a more popular option due to the reduced overhead in maintenance and deployment, as well as the options for fully-hosted and managed environments.

In some scenarios based on what is most important to a client, it is not immediately clear which search engine (including commercial engines) will best serve a client’s needs despite the application of a scoring rubric. In such cases, a “bake-off” can be performed using sample data sets for a client-facing evaluation on how well each engine performs for a specific set of use cases.

At the end of the day, both Solr and Elasticsearch are powerful, flexible, scalable, and extremely capable open source search engines. Overall use cases and business requirements in conjunction with your desired features, operational considerations, and integrations with new cognitive search and analytics capabilities, will ultimately drive your decision whether to select Solr or Elasticsearch.

<<< Start >>>



<<< End >>>

Jonathan Blasenak

Functional & Industry Analytics Sr. Manager – Accenture Applied Intelligence

Subscription Center
Subscribe to Accenture's Search and Content Analytics Blog Subscribe to Accenture's Search and Content Analytics Blog