Google Cloud Search: First-hand review and analysis
August 6, 2019
August 6, 2019
In 2002, Google entered the enterprise search space with the introduction of the Google Search Appliance (GSA) - a rack-mounted device that provided document indexing and search functionality. Over time, the GSA gained popularity thanks to a number of powerful features:
As newer technology trends and features were introduced to the market, competitors of the GSA’s search solution started to attract more users. Here are some trends we have spotted over the past few years:
Fast-forward to 2016 when Google announced the GSA’s end-of-life (the last GSA license renewals end in the spring of 2019) and its replacement with a new cloud offering: Google Cloud Search.
Cloud Search was initially rolled-out as a G Suite service with the ability to search content from a number of Google services: Gmail, Drive, Calendar, Sites, People, and Groups. The next major Cloud Search release, announced at Google Cloud Next 2018, is a standalone service which adds support for indexing and search over third-party enterprise content, such as:
While Cloud Search seamlessly integrates third-party and G Suite content, it’s important to note that enterprises will not need to have a G Suite license in order to use Cloud Search – Cloud Search customers can choose to index only third-party content and make it available within their enterprise search solutions. See Cloud Search documentation.
<<< Start >>>
<<< End >>>
<<< Start >>>
<<< End >>>
Over the last eight months, Google has offered early access to select partners with significant enterprise search experience and to customers willing to try Cloud Search’s early alpha/beta versions with third-party connectivity. Accenture has been working with Google on the roll-out. We had a front-row seat in learning about the Cloud Search features, performing initial customer deployments, providing feedback, and influencing some of the future evolution of Cloud Search.
Below are our takeaways and analysis of some of the main Cloud Search features. They can serve as initial guidelines for companies looking to leverage Cloud Search, whether it is to replace the GSA, to get started on their first search initiative, or to improve their current search experience.
1. CONTENT INDEXING
a. Data organization
Content is organized inside Cloud Search in data sources (collections of items that share the same schema). A schema is a blueprint of the type of objects that will be indexed inside a data source. It allows a user to define an object’s properties that hold custom metadata and their types (e.g. text, date, timestamp, integer, double, Boolean, etc.). The properties’ definitions include other characteristics that indicate if the field is returnable, facetable, sortable, etc.
The most noteworthy among a property’s characteristics is a capability unique to Cloud Search and not present in other search engines: the ability to give a property a user-friendly name (i.e. an operator name). This later allows for filtering of that property based on the operator in a user query that has been parsed with Natural Language Understanding (NLU) algorithms. For example, when a user searches for “tickets with a priority of 1,” priority is an operator name which triggers a filter on the priority field.
b. REST content indexing API
At the lowest level, Cloud Search provides a JSON-based REST API to index data into its data sources. An indexing request includes key elements such as:
OAUTH2.0 is the authentication mechanism required for using the Cloud Search APIs. Google provides a Java client library that simplifies the usage of REST APIs. In addition, there are mechanisms for the automatic generation of client libraries for other languages.
c. Connectors and connector SDK (Software Development Kit)
The out-of-the-box connectors included with Cloud Search are CSV, Web Crawlers, Windows File Systems, Relational Databases, Microsoft SharePoint 2013/2016, and SharePoint Online.
“Well, what if I need to ingest content sources other than those provided by Cloud Search’ out-of-the-box connectors?” You may ask. There is a couple of options:
- The ability to listen for change notifications – push style incremental indexing for content repositories that support this mechanism.
- Graph traversal – the ability to perform indexing through a recursive crawl starting at the root node, suitable for hierarchical repositories.
- The ability to leverage the Cloud Search Indexing Queue – a mechanism that allows for managing state and priorities for indexed items. This is especially useful for incremental indexing.
<<< Start >>>
<<< End >>>
d. Hierarchical Structures: ACLs and Documents
Cloud Search provides unparalleled support for expressive and powerful modeling techniques of search indexes that hold content from very complex repository structures:
2. USERS AND GROUPS INGESTION
Cloud Search supports security trimming through early binding. Early binding requires that the search engine determines all the groups to which a user belongs (once the user is authenticated) through a process called group expansion. This allows for items to be properly filtered in the search results by comparing the item’s ACLs to the user’s id and its groups. The GSA used to rely on external systems to perform group expansion. Cloud Search brings group expansion closer to its core: it interacts with other Google identity services to authenticate a user and determine the groups. As such, user identities and groups (generically called principals) have to be made available (indexed and synced) within these identity services.
Below are the core architecture components that support the ingestion of principals.
a. Cloud Identity
Cloud Identity is a Google directory service which hosts the account information for all users that have access to Google services, including Cloud Search. A user’s unique identifier is expressed as an email: uid@companydomain.com – note that the user is not required to have a G Suite license in order to obtain a unique identifier. This is supported by Google Cloud Identity’s free edition. A user can have multiple aliases – a user’s alias is expressed in the context of an identity source. The Google Admin SDK/REST API provides mechanisms for maintaining users and their aliases.
b. Identity source
An identity source is a collection of principals and has a unique identifier. There are two mechanisms through which principals (as part of an identity source) can be expressed:
These groups don’t need to have emails associated with them – they are considered as non-e-mailable groups or security groups (in the Active Directory terminology). The Cloud Identity service provides an SDK/REST API to maintain these non-e-mailable groups.
c. G Suite groups
For customers that have G Suite licenses, there is an ability to set up groups of users who can receive emails as a group – the unique identifier for these groups is an email. The equivalent Active Directory terminology for these groups is a Distribution List (DLs). The Google Admin SDK/REST API provides mechanisms for maintaining these e-mailable groups.
d. Expressions of principals in ACLs
Principals inside ACLs can be expressed as:
This opens up the possibility for straightforward ingestion of user ids and groups that are local to a content repository (i.e. not synced from an LDAP store) into Cloud Identity. For example, we can have a local group called admin in SharePoint and another group called admin in Confluence – they can both be ingested into Cloud Identity with their original ids. There is no name collision because of namespacing – each group id would be associated with a different identity source.
e. Mechanisms for ingesting identities
Aside from the SDKs/APIs mentioned above, Cloud Search also provides an identity connector SDK. Conceptually, it is similar to the connector framework SDK, which is a mechanism to simplify the development of connectors and sync users/groups from different content repositories to Google’s identity services. The current version of Google Cloud Directory Sync provides support for syncing users and e-mailable groups from LDAP stores. A future version will add support for non-e-mailable groups.
3. QUERY AND USER INTERFACES
a. Query API
A JSON-based REST API is provided for searching data with the following features:
b. User interfaces (UIs)
Custom search user interfaces can be built on top of the query/suggest APIs discussed above. Cloud Search also provides an out-of-the-box HTML-based embeddable search widget (ESW) that can be deployed either as a standalone UI or embedded in a larger application (for example, an existing portal). The ESW provides some limited customization of displayed elements and of the look and feel through CSS styling.
c. Listing API
The results of a query do not include any sensitive data, such as ACL information, nor do they provide access to the entire content of an item. For debugging and troubleshooting, Cloud Search provides a listing API which gives access to the entire structure of an item inside Cloud Search: ACLs, full content, version, item’s status, etc.
4. SEARCH RELEVANCE
In traditional Google style, there aren’t too many knobs one can use to tune the engine’s relevancy – some capabilities are still present though:
There is also no query explain functionality to see a query’s execution path like what Lucene-based search engines provide. This is understandable – relevance algorithms are one the main “secret sauces” that brought Google search to where it is today.
With that said, based on some of the ingredients we supply to Google’s relevance machine, we can expect Cloud Search to provide some of the best of breed search relevance quality in the enterprise space, with a special emphasis on user personalization. Here are some examples:
- All query, facet, filter, sort, pagination requests as a result of using the query API
- URLs for search results are wrapped around with a redirect mechanism which allows the clicks on every single search result to be captured
- The index API provides a mechanism to supply document interactions that occur outside of the search application (records of time and user are created when a document is viewed/edited directly through the content repository’s UI)
Opportunities for Improvement
As with any first release of a major service, there are some rough edges that can be improved over time. For instance:
However, the core and critical features necessary for deploying powerful enterprise search solutions are present in Cloud Search’s first general availability release. Also, future Cloud Search versions will likely address many of the perceived shortcomings listed above.
As reflected throughout this blog post, the user is at the center of what the Cloud Search search experience aims to provide: knowing the user and his/her context is essential. Some popular Cloud Search use cases we envision are:
On the other hand, a typical Cloud Search deployment may not be a perfect match for some use cases:
We frequently get asked by clients: “Why can’t I get the google.com search experience in my enterprise?” With the release of Google Cloud Search for Third-Party Content, we are now closer to achieving that goal. User personalization and the ability to apply cloud-based machine learning models to Natural Language Understanding for query interpretation and improved search relevance algorithms are the key factors that take Google’s enterprise search experience to the next level. At Accenture, we bring strategy, deployment expertise, and technology assets to help companies minimize risks often associated with a new search implementation and boost Cloud Search’ ROI.
Technology assets to support Google Cloud Search implementation
The diagram below illustrates how these technology components work together.
<<< Start >>>
<<< End >>>
Leveraging our experience with Google search products and technology assets, we can help build fully customized search solutions with Cloud Search, from content acquisition, processing, and enrichment to UI development.
<<< Start >>>
<<< End >>>