Semantic query parsing with Lucidworks Fusion: Key to understanding your search queries
Sep 23, 2021 • 12 min read
Semantic query parsing, described in “Semantic query parsing blueprint”, is a powerful technique to improve relevance of the search results. It has proven especially successful when a search solution deals with semi-structured data, such as ecommerce product catalog, customer profiles or real estate listings.
In our previous posts, we have described the structure of the core engine of the solution, yet in real-life implementation there is a lot of technical work related to managing infrastructure, ingesting the data, exposing APIs and organizing a query pipeline.
In this blog post, we will describe the architecture of an end-to-end semantic query parsing solution based on Lucidworks Fusion search platform. The Fusion search engine takes care of many mundane search implementation concerns allowing customers to focus on achieving best results relevance.
What is Lucidworks Fusion?
Lucidworks Fusion is a full-fledged search and AI platform based on Apache Solr and Apache Spark. It has a cloud-native microservices architecture orchestrated by Kubernetes. It can be deployed on K8s engine, either in the cloud or on premise, and is available as a PaaS. Fusion offers a fundamental integration framework for a search application, as well as a large number of valuable services, components, connectors, and APIs which simplify and accelerate the development.
Focus on features, not infrastructure
When developing search solutions, there is always a choice between the custom-built and a platform-based approach. While the custom approach provides ultimate flexibility, the platform-based solution is able to accelerate speed to market and typically gives better efficiency.
We can illustrate it in the following example. Let’s consider a common search flow diagram (a very simplified version) and evaluate where exactly Lucidworks Fusion capabilities accelerate the solution compared to a custom-built implementation:
A search system usually consists of index-time and query-time components. Index-time components are focused on extracting, processing, enriching and indexing the data, while query-time components are concerned with interpreting the query, retrieving and ranking search results.
In the case of Lucidworks Fusion, the routine process of data ingestion and indexing of prepared data is completely handled by the platform. Fusion platform fully manages Solr and Spark clusters and provides the framework and interfaces for the ingestion and indexing pipeline. Core intelligence of the search solution is implemented via custom data processing components which implement domain-specific data analysis and other business logic. Fusion platform provides the ability to plug in, chain and configure those custom components through configuration parameters and code interfaces.
In some rare cases, very specific and complex requirements lead to a bunch of low-level modifications to the Solr service and Lucene search library. Those modifications are available through Solr extensions and patches, which can be plugged into Fusion as well. However, such customizations require additional integration and maintenance efforts and generally should be avoided.
At the heart of any search system lies the query processing pipeline. It is responsible for
- Interpreting the natural language query.
- Enriching it with additional data and converting it into a low level boolean retrieval representation.
- Ranking instructions which can be executed by search engine.
Fusion provides a query pipelines framework which allows configuring and chain query processing components, passing the query context and parameters between them. This allows us to build fairly sophisticated query processing pipelines without writing a lot of boilerplate request processing and context management code.
Finally, the Fusion external API helps integrate front-end applications with the query processing pipeline.
Semantic query parsing with Fusion
Now, it’s time to map the semantic query parsing blueprint to the components and capabilities of the Fusion platform.
In this blog post, we will not focus on general product catalog indexing and full Search API features, such as faceting, filtering, business rules, autocomplete, etc… Fusion provides components available out of the box to cover many of those concerns. Let’s assume that the product index has been prepared, and consider specifics of semantic query parsing and multi-stage search.
Semantic search implementation contains three main steps:
- Semantic data ingestion and indexing, i.e. preparation of concept index, additional semantic information based on search configuration. It will be based on extra collections, index pipelines and Spark jobs.
- Concept-oriented query parsing, i.e. representation of the initial search phrase as a semantic query graph. This step will be a part of a query pipeline.
- Concept-oriented search, which includes transformation of semantic query graph to a search engine query and multi-stage search. It will also be part of the query pipeline.
Semantic data pipelines
Semantic data, in our context, contains all domain-specific information and configurations which is used in query understanding workflow and consists of several independent indexes:
- Concept index contains implicit or explicit concepts extracted from the indexed products and the knowledge base.
- Auxiliary semantic indexes contain additional semantic information like linguistics, comprehension patterns and scoring configuration.
Auxiliary semantic indexes
Structure of indexing pipelines for the auxiliary semantic indexes can be pretty simple. We need to perform the following steps for each index:
- Create a corresponding collection.
- Configure appropriate connector to the datasource depending on your storage, e.g. file, cloud storage, database, other index, etc.
- Implement or configure an existing parser for your data format, e.g. plain text, json, csv, xml, etc.
- Implement index pipeline with “Solr Indexer” stage and, if necessary, with some data preprocessing stages before.
Concept index
Structure of the concept indexing flow is more complicated than we used for the auxiliary semantic collections indexing. We can’t build it using only standard index pipelines because it is based both on product index and on a knowledge base. Also we need to use additional processing inside a pipeline.
It is convenient to organize this process as a Spark job. We can implement our job using Java, Scala or Python and deploy it to the Fusion platform. The job will retrieve documents from the product index and the knowledge base, extract attributes from them, perform preprocessing or filtering for extracted values, if necessary, and pass them to concept collection using the rest of indexing pipeline steps.
Index pipeline helps to separate data ingestion logic from concept extraction and concept indexing parts of code, keeping your code neat and to the point.
Сoncept reindexing can be configured to launch just after product indexing job completion or once per week in case the catalog is not changed frequently. Alternatively, it can be part of a full reindexing process.
The logic responsible for the transformation of raw product attributes to the concept documents can be implemented using the custom JavaScript stage or deploying a Java-based plug-in using Index Stage SDK. Latter is a preferable option as it provides more convenient configuration compared to JavaScript, which requires embedding all parameters in the script code directly.
For example, let’s take a look at a cURL request [3] which creates a small index pipeline with custom JavaScript stage:
curl -u admin:password1 -X POST -H 'Content-type: application/json' -d '{
"id" : "eventsim",
"stages" : [ {
"type" : "date-parsing",
"sourceFields" : [ "ts", "registration" ],
"dateFormats" : [ ],
"defaultTimezone" : "UTC",
"defaultLocale" : "en"
}, {
"type" : "javascript-index",
"script" : "function(doc) {n var states = {"AL": "Alabama","AK": "Alaska","AS": "American Samoa", //... and all other state codes in this manner};nvar loc = "";nvar idx = 0;nvar abbr = "";nvar state = "";nif (doc.hasField("location")) {n tloc = doc.getFirstFieldValue("location");n tidx = loc.lastIndexOf(",");n tif(idx > 0) {n t abbr = loc.substr(idx + 2, 2);n state = states[abbr];n doc.addField("state", abbr);n }n}nreturn doc;n}",
"label" : "Get State"
}, {
"type" : "solr-index",
"enforceSchema" : true,
"dateFormats" : [ ],
"params" : [ ]
}]}' http://fusion-host:6764/api/apps/analytics/index-pipelines
and compare it with the same one but the custom indexing stage was implemented via Index Stage SDK:
curl -u admin:password1 -X POST -H 'Content-type: application/json' -d '{
"id" : "eventsim",
"stages" : [ {
"type" : "date-parsing",
"sourceFields" : [ "ts", "registration" ],
"dateFormats" : [ ],
"defaultTimezone" : "UTC",
"defaultLocale" : "en"
}, {
"type" : "get-state-index-stage",
"pluginStageType": "get-state-index-stage",
"states": {
"AL": "Alabama",
"AK": "Alaska",
"AS": "American Samoa",
"AZ": "Arizona",
"AR": "Arkansas",
"CA": "California",
//... and all other state codes in this manner
},
"label" : "Get State"
}, {
"type" : "solr-index",
"enforceSchema" : true,
"dateFormats" : [ ],
"params" : [ ]
}]}' http://fusion-host:6764/api/apps/analytics/index-pipelines
in Fusion UI this stage will look like:
Concept-oriented query parsing
The next step is construction of a query pipeline, which represents concept-oriented query parsing workflow. This the pipeline will be responsible for the following operations:
- Initial transformation of the search phrase to the semantic query graph representation.
- Graph enrichment with spelling corrections, linguistics, comprehensions, compounds and other available semantic information.
- Concept tagging there we identify fields in which we can find particular parts of the query.
- Path scoring ranks of alternative query interpretations represented as graph paths.
From the Fusion platform point of view, query understanding workflow is just a regular query pipeline, where each step is a separate query stage. Like in index pipeline, stage in query pipeline can be implemented using JavaScript stage or Java and Query Stage SDK (more preferable option as well). From the query stage, we have access to internal Lucidworks Fusion APIs, which allow us to communicate with existing collections and retrieve the data required for the particular query understanding step.
We perform the same query analysis as for the product index and build the initial graph. After that, the graph goes as a context object through the stages of the query pipeline where it is enriched, transformed and cleaned. Final query graph will represent all our hypotheses about possible interpretation of each piece of the initial search phrase and for each hypothesis we can estimate how probable it is compared to others.
Concept-oriented search
One of the common techniques to achieve a good balance between precision and recall in search results is a multi-stage search workflow. In this workflow, we are gradually relaxing search query interpretations until we are able to find good results.
Each search stage dictates its own configuration of fields participating in the stage and types of matches and normalizations allowed. We perform query understanding according to the configuration and build a semantic query graph. This graph is later used to build the actual Solr query that Fusion executes against the main index.
To implement this efficiently, we need to run query understanding and query building pipelines several times with different configuration.
Luckly, Fusion allows us to achieve this separation and reuse with “call pipeline” query stage:
“Call pipeline” stage allows us to encapsulate and reuse query understanding and query building parts of the pipeline in every search stage as well as implement interactions between stages inside the pipeline.
Query building
Now, let’s look into the process of converting our query interpretation to a Solr query. Each particular graph converting step is implemented as a separate Fusion query pipeline stage in the same way as for query understanding. It can be completely separate stages (Java-based plug-ins or scripts, if you prefer JavaScript) or stages based on the same code with configuration specifying which parts of the semantic query graph should be used for the particular stage.
Let’s consider a simple search phrase like “m olive shirt dress” and its possible graph representation in order to better understand the query building process.
For each search stage we will use only allowed graph edges to build a search query:
- Exact match – full search phrase is covered by product attributes like size, color and product type.
- Incomplete match – full search phrase is covered by product attributes but we suggest more general color “green” which is hypernym for initially more strict “olive”.
- Partial match – we assume that the user could make a typo and mean “short” instead of “shirt”, so we can try to find a “short dress”. Also we can try to omit some original terms and search just for “shirt” or “dress” instead of “shirt dress”.
- Text match – here we use all the hypotheses we have, i.e. we are trying to match individual terms in product descriptions using hypernym and spelling correction.
As you can see, every subsequent stage allows for more relaxed matches, increasing recall at the expense of precision.
Staged search
Let’s consider a simple scenario, where we want to implement multi-stage search with exit criteria by the number of products found, i.e. if we have found enough items, we can return these results, otherwise we should switch to a more relaxed query interpretation and try again.
Each such search stage represents an interconnected pair of query pipeline stages:
- Custom query stage which encapsulates Solr query building logic.
- Standard “Solr Query” stage which sends the request to the search engine.
Fusion allows setting a condition for all pipeline stages which determines whether the stage is run or skipped. We can use this mechanism to control the search flow. For each request to the search engine we have the number of items matched in the response. So, after each block of query understanding, query building and query execution, we can check the precondition to return the result or follow to the next stage.
Adding a UI to look at the results
Even if the search application is a backend service for a frontend, it is always a good idea to have a simple UI to look at the search results. Fusion provides an easy way to create such a UI. Lucidworks App Studio is a collection of software and resources that helps to quickly build feature rich search apps[3]. Out of the box it provides a lot of components to visualize such search features as faceting, filtering, sorting, typeahead, highlights, etc. We can easily integrate it with our Search API. The only thing we need to do is to create a query profile associated with the main query pipeline and to configure App Studio to use this query profile.
At this point we have the search flow completely implemented within Fusion platform, from the connector to the datasource with product catalog to the front-end application which interacts with customers. No third-party service was not necessary apart from cloud or on premise resources to deploy the platform itself, everything was implemented using Fusion platform capabilities only.
Comparing approaches
Once we have all pipelines configured, it would be pretty interesting to compare the technique of semantic query parsing with default Fusion search pipeline [3] and see if we have the improvements.
For this comparison we will consider a small catalog containing about 14,000 passenger tires from different brands and with various specifications. For the visualization we created an application based on Fusion App Studio and connected it to our query pipeline via query profile as was described in [3].
Let’s assume that a customer is looking for a set of tires with section width 155 and rim diameter R13, all other parameters don’t matter for him. So what will he see in the result set in case of the regular Fusion search implementation?
First of all, we see that the result set contains tires with rim diameter 14 and section widths 175, 165, 145, etc, as we clearly see looking at the size facet. Further we notice the tire with section width 175 on the top of the result set. And finally, only 2 from 6 tires with exact size “155 R13” are present on the first page. Recall of this approach is quite good, as all relevant tires are found, but precision can be significantly improved.
Let’s look at the concept-oriented search results for the same query:
Now the result set looks much better. We retrieved tires with section width 155 only. Moreover, all 6 tires with exact size “155 R13” are placed on the top of the result set, i.e. these products got a higher score than others. Let’s look at the semantic graph for this query.
This query representation lets us understand the results we got. Semantic query parsing detected that token “155” means section width, token “r13” means rim diameter. But it also found that the combination of these tokens is the exact size value in our catalog and boosted this hypothesis in the final Solr query. So that’s why we got higher scores for the top 6 tires in our result set.
Conclusion
In this blog post, we described the practical approach to implement semantic query parsing technique on Fusion platform. We showed how Fusion platform allows you to focus on the core search algorithms while taking care of a lot of boilerplate concerns such as data integration and query pipelining. With a combination of semantic query parsing technique and Fusion platform we can bring relevant results to the customers even faster with higher quality.
Happy searching!
References
- Lucidworks Fusion Architecture
- Semantic query parsing blueprint
- Guy Sperry (2020), Fusion in Action, Version 5, Manning Publications