How to replatform Endeca rules to Solr
Apr 07, 2020 • 6 min read
In the previous article we discussed the Endeca rules model and explained how to re-implement this model using Elasticsearch. We needed to implement inverted search to trigger our rules and we leveraged powerful percolator feature in Elasticsearch which greatly simplified our implementation. In this blog post, we will discuss how to approach implementation of Endeca rules if you are running Solr.
Unfortunately, Solr currently does not have a percolator-like functionality. We believe it will be available soon because Lucene 8.2 support is already merged. Meanwhile, we can employ an alternative approach to implement inverted search based purely on Solr queries. We will use the same example we used in previous article for illustration.
Quick refresher on Endeca triggers
Firstly, let’s recall the particular trigger types that we will have to implement:
Match phrase: the search phrase contains search terms sequentially in a strict order but may also contain other words before or after.
Example: The rule is configured with search terms = “how to”. The search phrase “how to make an order” will trigger this rule. At the same time, the search phrase “how can I get to the store” will not trigger the rule.
Match all: the search phrase contains all search terms in any order with optional additional words in any position.
Example: The rule is configured with search terms = “oven best pizza”. The search phrase “what is the best oven for cooking pizza” will trigger this rule.
Match exact: the rule will be triggered only and only when the search phrase is exactly equal to search terms. No additional words are allowed.
Example: The rule is configured with search terms “order status”. Only the search phrase “order status” will trigger this rule, not any other.
We will use those triggers as an example and we will use default Solr configuration for simplicity. So, lets roll our sleeves and get some inverted search up&running!
First, after launching Solr we need to create the new core/collection for named rules. We can do it from the core admin page or from the terminal by executing .bin/solr create -c rules command.
Solr rules model
We are going to use the same logical rule structure as in the previous post. The rule will be modeled as a parent document, with triggers represented as child documents. So how the example from the previous post will look in the case of Solr?
In this post, we will extend our simple rule engine functionality with two essential features : rule collapsing and sorting. Rule collapsing refers to the situation when multiple rules of the same type fire, and we have to select the one with the highest priority (represented as lowest priority number).
Let’s start with basic rule structure. Note that we used *_i, *_s and *_t suffixes in order to map integer, string and text field types respectively.
{ "id": "1", "priority_i": 1, //1 "action_t": "<some serialized action>", // 2 "actionType_s": "<REDIRECT/FACET/BOOST/BURY....>", "scope_s": "rule", "_childDocuments_": [ { "id": "1", "keyword_s": "<phrase to be triggered on>", "keyword_t": "<phrase to be triggered on>", //3 "keyword_words_count_i":<Integer value. Count of words in keyword field>, //3 "matchmode_s": "<MATCHEXACT/MATCHPHRASE/MATCHALL>", "scope_s": "trigger" }, { ..... } ] }
- Priority, which is used to sort rules of the same type during collapsing. Highest priority is represented as a lowest value. When multiple rule of the same type fire, rule with the lowest priority value is selected.
- This is the action payload. In case of REDIRECT it is URL, in case of other more complicated rules it can be JSON with serialized rule object.
-
Technical derivatives of keyword fields. Usually are created on Solr side using custom Update Request Processors, but in this example (for simplicity) let’s use approach to produce derivatives on the client side.
- keyword_s – not tokenized keyword representation. Used for MATCHEXACT and MATCHPHRASE triggers.
- keyword_t – tokenized keyword representation. Used for MATCHALL trigger matching.
- keyword_word_count_i – Used for MATCHALL trigger matching.
Now, lets convert our sample rules into the Solr input structure:
{ "id": "1", "priority_i": 2, "action_t": "http://retailername.com/FAQ", "actionType_s": "REDIRECT", "scope_s": "rule", "_childDocuments_": [ { "id": "tr1", "keyword_s": "how to", "matchmode_s": "MATCHPHRASE", "scope_s": "trigger", "keyword_t":"how to", "keyword_words_count_i":2 } ] }, { "id": "2", "priority_i": 1, "action_t": "http://retailername.com/orders", "actionType_s": "REDIRECT", "scope_s": "rule", "_childDocuments_": [ { "id": "tr2", "keyword_s": "order status", "matchmode_s": "MATCHEXACT", "scope_s": "trigger", "keyword_t":"order status", "keyword_words_count_i":2 } ] }, { "id": "3", "priority_i": 1, "action_t": "http://retailername.com/top10ovens", "actionType_s": "REDIRECT", "scope_s": "rule", "_childDocuments_": [ { "id": "tr3", "keyword_s": "oven best pizza", "matchmode_s": "MATCHALL", "scope_s": "trigger", "keyword_t": "oven best pizza", "keyword_words_count_i": 3 } ] }
After the indexing, we will have our core filled with our sample rules.
Matching rules using Solr queries
Since we have 3 different match modes, in order to build our inverted search query, we need to create a disjunction boolean query. We will show you the final result and then walk you through every part of the query.
Let’s use the keyword “how to cook” as an example. Below is a complete request how to match rules using the “how to cook” user keyword.
http://localhost:8983/solr/rules/select?exactQuery=keyword_s:"how to cook"&fq={!collapse field=actionType_s sort='priority_i asc'}&matchAllQuery={!frange l=0 u=0 incl=true incu=true v='sub(sum(max(0, query({!lucene v="keyword_t:how^=1"})),max(0, query({!lucene v="keyword_t:to^=1"})),max(0, query({!lucene v="keyword_t:cook^=1"}))),field(keyword_words_count_i))'}&phraseQuery=keyword_s:"how to" OR keyword_s:”to cook”&q={!parent which=scope_s:rule v=$triggerQuery}&triggerQuery=+(({!lucene v=$exactQuery} AND filter(matchmode_s:MATCHEXACT)) OR ({!lucene v=$phraseQuery} AND filter(matchmode_s:MATCHPHRASE)) OR ({!lucene v=$matchAllQuery} AND filter(matchmode_s:MATCHALL))) AND filter(scope_s:trigger)
As you can see this request correctly returns rule no.1 associated with a matchPhrase trigger configured on “how to”.
So, lets analyze all parts of this complex query
http://localhost:8983/solr/rules/select?
is a request to regular select RequestHandler
q={!parent which=scope_s:rule v=$triggerQuery}&
ToParentBlockJoinQuery is needed to match Rule (parent document) by it’s matched triggers (child documents)
triggerQuery=+(({!lucene v=$exactQuery} AND filter(matchmode_s:MATCHEXACT)) OR ({!lucene v=$phraseQuery} AND filter(matchmode_s:MATCHPHRASE)) OR ({!lucene v=$matchAllQuery} AND filter(matchmode_s:MATCHALL))) AND filter(scope_s:trigger)&
This is the main query for matching triggers. As you can see, this query is a disjunction query with 3 clauses for 3 different match modes. The specific queries for each type are extracted to separate nested params exactQuery, phraseQuery and matchAllQuery
exactQuery=keyword_s:"how to cook"&
MatchExact query, It is very straightforward – we just need to check if that keyword field content is exactly the same as the user’s query. As we are only looking for exact match, un-tokenized string field is used.
phraseQuery=keyword_s:"how to" OR keyword_s:”to cook”&
MatchPhrase query. Here the query parser needs to cut all possible n-grams from the user search phrase. As we have a very short example keyword, we have only two n-grams “how to” and “to cook”. Using this approach, we are matching only those triggers which contain some subphrase of the user keyword.
matchAllQuery={!frange l=0 u=0 incl=true incu=true v='sub(sum(max(0, query({!lucene v="keyword_t:how^=1"})),max(0, query({!lucene v="keyword_t:to^=1"})),max(0, query({!lucene v="keyword_t:cook^=1"}))),field(keyword_words_count_i))'}&
Matchall query is the trickiest one, leading to inverted search problem. We will discuss it separately to properly explain all the details
fq={!collapse field=actionType_s sort='priority_i asc'}
Collapse Filter query in order to fetch only no.1 rule of each type with the lowest priority
Matchall query
Formally speaking, matchAll query means that we have to find such rules, where the tokens configured in the trigger are the subset of tokens from the user query. We don’t know which tokens will match, but we know that the number of matched tokens should be exactly the same as the total number of tokens in the trigger.
We conveniently store the number of tokens in the keyword in the field keyword_words_count_i.
We will use S0lr function query framework to perform this precise matching. Function queries were designed for match scoring, but with some simple tricks we can use them for precise filtering as well:
{!frange l=0 u=0 incl=true incu=true v=' //5 sub( // 4 sum( // 2 max(0, query({!lucene v="keyword_t:how^=1"})),// 1 max(0, query({!lucene v="keyword_t:to^=1"})), max(0, query({!lucene v="keyword_t:cook^=1"})) ), field(keyword_words_count_i)) // 3 '}
We will unwind this query from inside out, so follow the numbers in the listing:
- This clause returns a score of 1 if a term is present in the trigger. We will count every match as score 1.
- We sum all scores getting a total number of matches
- We retrieve the value of field keyword_words_count_i which contains expected number of matches
- We substract the number of matches and expected number of matches. If we are getting zero, rule should be retrieved.
- We use frange to create a range query over internal function value. We set upper and lower bounds of range to 0 to select only zero scores.
That’s it. Now we are able to perform inverted search and match our AllMatch triggers.
Lets consider some more examples:
The request for “order status” keyword, which correctly matches rule no. 2 associated with matchExact trigger configured on phrase “order status” goes as follows:
http://localhost:8983/solr/rules/select?exactQuery=keyword_s:%22order%20status%22&fq={!collapse%20field=actionType_s%20sort=%27priority_i%20asc%27}&matchAllQuery={!frange%20l=0%20u=0%20incl=true%20incu=true%20v=%27sub(sum(max(0,%20query({!lucene%20v=%22keyword_t:order^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:status^=1%22}))),field(keyword_words_count_i))%27}&phraseQuery=keyword_s:%22order%20status%22&q={!parent%20which=scope_s:rule%20v=$triggerQuery}&triggerQuery=+(({!lucene%20v=$exactQuery}%20AND%20filter(matchmode_s:MATCHEXACT))%20OR%20({!lucene%20v=$phraseQuery}%20AND%20filter(matchmode_s:MATCHPHRASE))%20OR%20({!lucene%20v=$matchAllQuery}%20AND%20filter(matchmode_s:MATCHALL)))%20+filter(scope_s:trigger)
The request for “best oven for pizza” keyword, which correctly matches rule no. 3 associated with matchAll trigger configured on words set “oven best pizza” goes as follows:
http://localhost:8983/solr/rules/select?http://localhost:8983/solr/rules/select?exactQuery=keyword_s:%22best%20oven%20for%20pizza%22&matchAllQuery={!frange%20l=0%20u=0%20incl=true%20incu=true%20v=%27sub(sum(max(0,%20query({!lucene%20v=%22keyword_t:best^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:oven^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:for^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:pizza^=1%22}))),field(keyword_words_count_i))%27}&phraseQuery=keyword_s:%22best%20oven%22%20OR%20keyword_s:%22oven%20for%22%20OR%20keyword_s:%22for%20pizza%22%20OR%20keyword_s:%22best%20oven%20for%22%20OR%20keyword_s:%22oven%20for%20pizza%22&q={!parent%20which=scope_s:rule%20v=$triggerQuery}&triggerQuery=+(({!lucene%20v=$exactQuery}%20AND%20filter(matchmode_s:MATCHEXACT))%20OR%20({!lucene%20v=$phraseQuery}%20AND%20filter(matchmode_s:MATCHPHRASE))%20OR%20({!lucene%20v=$matchAllQuery}%20AND%20filter(matchmode_s:MATCHALL)))%20AND%20filter(scope_s:trigger)&fq={!collapse%20field=actionType_s%20sort=%27priority_i%20asc%27}
We can also consider keyword “how to cook best pizza” which is matching both “how to” matchPhrase trigger and “oven best pizza” matchAll trigger, but because of collapsing filter query(fq) we are getting only rule no. 3 with the highest priority.
http://localhost:8983/solr/rules/select?exactQuery=keyword_s:%22how%20to%20oven%20best%20pizza%22&matchAllQuery={!frange%20l=0%20u=0%20incl=true%20incu=true%20v=%27sub(sum(max(0,%20query({!lucene%20v=%22keyword_t:how^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:to^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:oven^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:best^=1%22})),max(0,%20query({!lucene%20v=%22keyword_t:pizza^=1%22}))),field(keyword_words_count_i))%27}&phraseQuery=keyword_s:%22how%20to%22%20OR%20keyword_s:%22to%20oven%22%20OR%20keyword_s:%22oven%20best%22%20or%20keyword_s:%22best%20pizza%22%20OR%20keyword_s:%22how%20to%20oven%22%20OR%20keyword_s:%22to%20oven%20best%22%20OR%20keyword_s:%22oven%20best%20pizza%22%20OR%20keyword_s:%22how%20to%20oven%20best%22%20OR%20keyword_s:%22to%20oven%20best%20pizza%22&q={!parent%20which=scope_s:rule%20v=$triggerQuery}&triggerQuery=+(({!lucene%20v=$exactQuery}%20AND%20filter(matchmode_s:MATCHEXACT))%20OR%20({!lucene%20v=$phraseQuery}%20AND%20filter(matchmode_s:MATCHPHRASE))%20OR%20({!lucene%20v=$matchAllQuery}%20AND%20filter(matchmode_s:MATCHALL)))%20AND%20filter(scope_s:trigger)&fq={!collapse%20field=actionType_s%20sort=%27priority_i%20asc%27}
Conclusion
In this blog post, we discussed the trickiest part of Endeca rule migration, matchAll trigger implementation. Full fledged implementation should also include other aspects, such as:
- Matching rules by selected filters. We split the queries into two exactLocation true/false trigger matching. TrueExactLocation implementation is trivial, while false exactLocation is very similar to the inverted search approach used in the matchAll trigger matching.
- Matching default rules.
- Applying normalizations, such as spelling correction to ensure that if “jeans” product is retrieved by a misspelled “jeanz” phrase, the rule configured for “jeans” will fire as well.
- Splitting triggers by Browse and Search navigation types.
Happy searching!