In 2002, Google entered the enterprise search space with the introduction of the Google Search Appliance (GSA) - a rack-mounted device that provided document indexing and search functionality. Over time, the GSA gained popularity thanks to a number of powerful features:
- Simple setup of a web crawler to index content from an enterprise’s internal websites
- Out-of-the-box user interface (UI) to quickly give users access to search
- Connectors to index richer metadata structures and ACLs (Access Control Lists) from a variety of content management systems
- Integration with an organization’s authentication systems and leverage of the indexed ACLs to display only the search results that users are allowed to view
As newer technology trends and features were introduced to the market, competitors of the GSA’s search solution started to attract more users. Here are some trends we have spotted over the past few years:
- As the migration to the cloud took a stronghold in the enterprise IT, the GSA’s deployment inside an enterprise data center and its lack of auto-scaling capabilities made it less flexible compared to other search solutions.
- Other proprietary and open source search solutions continued to evolve their indexing and query APIs at a pace that made the GSA feel rigid.
- Although the GSA’s search relevancy models were some of the best, the inability to feed substantially more data to help improve these models put a cap on what the GSA could provide in terms of search relevancy.
From the Appliance to the Cloud
Fast-forward to 2016 when Google announced the GSA’s end-of-life (the last GSA license renewals end in the spring of 2019) and its replacement with a new cloud offering: Google Cloud Search.
Cloud Search was initially rolled-out as a G Suite service with the ability to search content from a number of Google services: Gmail, Drive, Calendar, Sites, People, and Groups. The next major Cloud Search release, announced at Google Cloud Next 2018, is a standalone service which adds support for indexing and search over third-party enterprise content, such as:
- Internal websites and portals
- Content management systems
- File systems
- Relational databases
- Content hosted in enterprise applications
While Cloud Search seamlessly integrates third-party and G Suite content, it’s important to note that enterprises will not need to have a G Suite license in order to use Cloud Search – Cloud Search customers can choose to index only third-party content and make it available within their enterprise search solutions. See Cloud Search documentation.
A first look at Google Cloud Search – Key features
<<< Start >>>
<<< End >>>
<<< Start >>>
<<< End >>>
Over the last eight months, Google has offered early access to select partners with significant enterprise search experience and to customers willing to try Cloud Search’s early alpha/beta versions with third-party connectivity. Accenture has been working with Google on the roll-out. We had a front-row seat in learning about the Cloud Search features, performing initial customer deployments, providing feedback, and influencing some of the future evolution of Cloud Search.
Below are our takeaways and analysis of some of the main Cloud Search features. They can serve as initial guidelines for companies looking to leverage Cloud Search, whether it is to replace the GSA, to get started on their first search initiative, or to improve their current search experience.
1. CONTENT INDEXING
a. Data organization
Content is organized inside Cloud Search in data sources (collections of items that share the same schema). A schema is a blueprint of the type of objects that will be indexed inside a data source. It allows a user to define an object’s properties that hold custom metadata and their types (e.g. text, date, timestamp, integer, double, Boolean, etc.). The properties’ definitions include other characteristics that indicate if the field is returnable, facetable, sortable, etc.
The most noteworthy among a property’s characteristics is a capability unique to Cloud Search and not present in other search engines: the ability to give a property a user-friendly name (i.e. an operator name). This later allows for filtering of that property based on the operator in a user query that has been parsed with Natural Language Understanding (NLU) algorithms. For example, when a user searches for “tickets with a priority of 1,” priority is an operator name which triggers a filter on the priority field.
b. REST content indexing API
At the lowest level, Cloud Search provides a JSON-based REST API to index data into its data sources. An indexing request includes key elements such as:
- Common metadata: fields present across all items, regardless of the data sources they reside in or the schema associated with them - the item’s source URL, mime type, content language, creation/update time, etc. A document content’s language can also be auto-detected. Cloud Search provides support for over 100 languages. If you like the quality of Google Translate services, it is reasonable to also expect the same type of high-quality language support in Cloud Search.
- Structured data: custom fields – the metadata expressed in a data source’s schema
- Content: can be expressed as full text or in raw format (Cloud Search supports automatic text extraction from binary content for most popular document formats: Office documents, PDFs, etc.).
- Access Control Lists (ACLs): both permit and denied permissions are supported.
- Unique document identifier and version
OAUTH2.0 is the authentication mechanism required for using the Cloud Search APIs. Google provides a Java client library that simplifies the usage of REST APIs. In addition, there are mechanisms for the automatic generation of client libraries for other languages.
c. Connectors and connector SDK (Software Development Kit)
The out-of-the-box connectors included with Cloud Search are CSV, Web Crawlers, Windows File Systems, Relational Databases, Microsoft SharePoint 2013/2016, and SharePoint Online.
“Well, what if I need to ingest content sources other than those provided by Cloud Search’ out-of-the-box connectors?” You may ask. There is a couple of options:
- Option 1: A rich collection of connectors from the partner ecosystem: Google Cloud Search has a vibrant ecosystem of over 50 partners worldwide who have quickly developed more than 80 connectors for over 60 enterprise content sources. Accenture, as a long-time Google search partner, has adapted our Aspire framework to provide a range of search engine agnostic connectors as well as a Cloud Search publisher that can ingest and publish content from multiple unstructured and structured enterprise sources into Cloud Search. This option can help lower risks and accelerate your development-to-launch time.
- Option 2: Google provides a Java-based connector framework that simplifies the development of connectors for other content and identity sources. The Cloud Search connector framework provides several strong features:
- The ability to listen for change notifications – push style incremental indexing for content repositories that support this mechanism.
- Graph traversal – the ability to perform indexing through a recursive crawl starting at the root node, suitable for hierarchical repositories.
- The ability to leverage the Cloud Search Indexing Queue – a mechanism that allows for managing state and priorities for indexed items. This is especially useful for incremental indexing.
<<< Start >>>
<<< End >>>
d. Hierarchical Structures: ACLs and Documents
Cloud Search provides unparalleled support for expressive and powerful modeling techniques of search indexes that hold content from very complex repository structures:
- For hierarchical content repositories (the simplest example: file systems – with folders and files), one can index items as either Container items (folders) or Content items (files). This allows the expression of the entire object hierarchy in search. Here’s an example of a powerful capability supported by this feature: when a folder is deleted in the file system, in other search engines, one has to send delete requests for the folder and each item inside the folder. But in Cloud Search, a delete for just the folder is sufficient (Cloud Search will automatically delete all sub-items inside that folder). The ability to filter items belonging to the same folder is another future capability enabled by this feature.
- Three types of ACL inheritance are supported: Parent Override (a parent item’s permissions override the item’s ACLs), Child Override, or Both (to be able to view an item, a user would need view access to both the parent and the child items).
- There are two mechanisms by which principals (as part of an identity source) can be expressed: user aliases (for individuals) and group ids (for groups). For instance, Document A is a child of Document B but inherits the ACLs from Document C.
2. USERS AND GROUPS INGESTION
Cloud Search supports security trimming through early binding. Early binding requires that the search engine determines all the groups to which a user belongs (once the user is authenticated) through a process called group expansion. This allows for items to be properly filtered in the search results by comparing the item’s ACLs to the user’s id and its groups. The GSA used to rely on external systems to perform group expansion. Cloud Search brings group expansion closer to its core: it interacts with other Google identity services to authenticate a user and determine the groups. As such, user identities and groups (generically called principals) have to be made available (indexed and synced) within these identity services.
Below are the core architecture components that support the ingestion of principals.
a. Cloud Identity
Cloud Identity is a Google directory service which hosts the account information for all users that have access to Google services, including Cloud Search. A user’s unique identifier is expressed as an email: firstname.lastname@example.org – note that the user is not required to have a G Suite license in order to obtain a unique identifier. This is supported by Google Cloud Identity’s free edition. A user can have multiple aliases – a user’s alias is expressed in the context of an identity source. The Google Admin SDK/REST API provides mechanisms for maintaining users and their aliases.
b. Identity source
An identity source is a collection of principals and has a unique identifier. There are two mechanisms through which principals (as part of an identity source) can be expressed:
- User aliases (for individuals)
- Group ids (for groups)
These groups don’t need to have emails associated with them – they are considered as non-e-mailable groups or security groups (in the Active Directory terminology). The Cloud Identity service provides an SDK/REST API to maintain these non-e-mailable groups.
c. G Suite groups
For customers that have G Suite licenses, there is an ability to set up groups of users who can receive emails as a group – the unique identifier for these groups is an email. The equivalent Active Directory terminology for these groups is a Distribution List (DLs). The Google Admin SDK/REST API provides mechanisms for maintaining these e-mailable groups.
d. Expressions of principals in ACLs
Principals inside ACLs can be expressed as:
- A user’s unique identifier (an email)
- A user alias (identity source id + alias id)
- An e-mailable group (an email)
- A non-emailable group (identity source id + group id)
This opens up the possibility for straightforward ingestion of user ids and groups that are local to a content repository (i.e. not synced from an LDAP store) into Cloud Identity. For example, we can have a local group called admin in SharePoint and another group called admin in Confluence – they can both be ingested into Cloud Identity with their original ids. There is no name collision because of namespacing – each group id would be associated with a different identity source.
e. Mechanisms for ingesting identities
Aside from the SDKs/APIs mentioned above, Cloud Search also provides an identity connector SDK. Conceptually, it is similar to the connector framework SDK, which is a mechanism to simplify the development of connectors and sync users/groups from different content repositories to Google’s identity services. The current version of Google Cloud Directory Sync provides support for syncing users and e-mailable groups from LDAP stores. A future version will add support for non-e-mailable groups.
3. QUERY AND USER INTERFACES
a. Query API
A JSON-based REST API is provided for searching data with the following features:
- Faceting and filtering
- Sorting and pagination
- Ability to search across multiple data sources - this includes predefined (first-party) data sources: Gmail, Google Drive, Sites, etc.
- Natural language interpretation of user queries – this is another Cloud Search’s differentiator compared to other search engines
- Snippet highlighting
- Multi-language support
b. User interfaces (UIs)
Custom search user interfaces can be built on top of the query/suggest APIs discussed above. Cloud Search also provides an out-of-the-box HTML-based embeddable search widget (ESW) that can be deployed either as a standalone UI or embedded in a larger application (for example, an existing portal). The ESW provides some limited customization of displayed elements and of the look and feel through CSS styling.
c. Listing API
The results of a query do not include any sensitive data, such as ACL information, nor do they provide access to the entire content of an item. For debugging and troubleshooting, Cloud Search provides a listing API which gives access to the entire structure of an item inside Cloud Search: ACLs, full content, version, item’s status, etc.
4. SEARCH RELEVANCE
In traditional Google style, there aren’t too many knobs one can use to tune the engine’s relevancy – some capabilities are still present though:
- A search quality indicator (a number) can be provided for each indexed item which can be used to boost higher or push lower an item’s display in the search results
- A setting to alter relevancy based on an item’s temporal freshness in the index
- The ability to boost content based on the data source provenance
- The ability to switch user personalization on and off
There is also no query explain functionality to see a query’s execution path like what Lucene-based search engines provide. This is understandable – relevance algorithms are one the main “secret sauces” that brought Google search to where it is today.
With that said, based on some of the ingredients we supply to Google’s relevance machine, we can expect Cloud Search to provide some of the best of breed search relevance quality in the enterprise space, with a special emphasis on user personalization. Here are some examples:
- There is a variety of user interactions that can be captured:
- All query, facet, filter, sort, pagination requests as a result of using the query API
- URLs for search results are wrapped around with a redirect mechanism which allows the clicks on every single search result to be captured
- The index API provides a mechanism to supply document interactions that occur outside of the search application (records of time and user are created when a document is viewed/edited directly through the content repository’s UI)
- With the entire user directory reflected inside Google’s services, Cloud Search can better understand the user’s context over time (e.g. organization chart, location, groups of people that might conduct similar searches, new hires, etc.)
- A better understanding of the relationships between documents in the indexed corpora
Opportunities for Improvement
As with any first release of a major service, there are some rough edges that can be improved over time. For instance:
- The Listing API provides few filtering capabilities to narrow the scope of the results to be retrieved.
- There isn’t yet a simple mechanism to tie results in a query request to results in a listing request.
- It’s difficult to filter common properties across multiple data sources.
- The administration UIs and operational statistics are not yet up to par with what a typical GSA administrator was used to.
- There's a lack of more powerful/sophisticated faceting and aggregation capabilities.
However, the core and critical features necessary for deploying powerful enterprise search solutions are present in Cloud Search’s first general availability release. Also, future Cloud Search versions will likely address many of the perceived shortcomings listed above.
Google Cloud Search use cases – What to consider?
As reflected throughout this blog post, the user is at the center of what the Cloud Search search experience aims to provide: knowing the user and his/her context is essential. Some popular Cloud Search use cases we envision are:
- Intranet portal search
- Corporate wide search
- Question and answer systems
On the other hand, a typical Cloud Search deployment may not be a perfect match for some use cases:
- Public content: Many of our GSA customers have deployed the appliance to support search on their public-facing websites (i.e. search over content already available on the Internet). The Cloud Search query APIs require authentication; thus, while doable, it is not currently straightforward to use Cloud Search for search applications that only provide search over public content. Something else to consider: google.com already does a good job of indexing public content.
- Log data: Another example of content where user identification is not essential is log data. The structure of log records is typically much simpler and thus some of Cloud Search’ powerful features are not applicable. One of the dimensions that factor into Cloud Search pricing is the number of items (i.e. documents or folders). So, a comparison of cost versus value should be considered before using Cloud Search for very large amounts of log data.
Powering a Google-like enterprise search experience with Google Cloud Search
We frequently get asked by clients: “Why can’t I get the google.com search experience in my enterprise?” With the release of Google Cloud Search for Third-Party Content, we are now closer to achieving that goal. User personalization and the ability to apply cloud-based machine learning models to Natural Language Understanding for query interpretation and improved search relevance algorithms are the key factors that take Google’s enterprise search experience to the next level. At Accenture, we bring strategy, deployment expertise, and technology assets to help companies minimize risks often associated with a new search implementation and boost Cloud Search’ ROI.
Technology assets to support Google Cloud Search implementation
- We have built a range of search engine independent connectors that facilitate the acquisition of unstructured and structured content from many sources.
- Our publisher framework features most of the common functionality required for publishing content to various search engines. Accenture’s Aspire Google Cloud Search publisher receives content from Aspire connectors and uses the Java Client library to index that content into Cloud Search.
- Accenture’s Aspire Content Processing is an ETL framework designed specifically for unstructured and semi-structured data. It provides optimal functionality, a wide range of ready-made processing components, a Hadoop implementation, and distributed processing capabilities.
The diagram below illustrates how these technology components work together.
<<< Start >>>
Accenture's Aspire Content Processing and connectors workflow for Google Cloud Search
<<< End >>>
Leveraging our experience with Google search products and technology assets, we can help build fully customized search solutions with Cloud Search, from content acquisition, processing, and enrichment to UI development.
<<< Start >>>
<<< End >>>