Monday, August 18, 2014

how to organize a multi-type elasticsearch query

Suppose your search involves quering a parent type as well as one or more of its child types.
Well, it's pretty obvious that you should use a Bool Query to combine the queries of the different document types.
However, suppose each type involves both a query and a filter. How would you then combine all the queries and filters together?
My first attempt at this was to use a single Filtered Query for the entire query, and each type would contribute to the overall FilteredQuery/query and FilteredQuery/filter.
I later found out this model sometimes produced incorrect search results, was complicated and perhaps not so efficient.
The more logical way to do this, is that each type produces its own Filtered Query, and the top Bool Query mearly joins those filtered queries together. This way, each type has it own independant clause, which results in a simple and clean overall query structure. And the search results are always correct, too.

The writer is R&D team leader at Niloosoft Hunter HRMS

Thursday, August 14, 2014

Clarifying elasticsearch TopChildren, "factor" & "estimated hits size"

I found the TopChildren documentation to not be totally clear. So here is my clarification.

The "estimated hits size" (also reffered to in the documentation as "hits expected") referes to the number of child documents hits. That is to say - how many child documents will be looked for in the query on the child docs.

The set of child documents thus found, are then aggregated into parents.

If you asked for 10 parents (query size=10), elasticsearch will use the default factor value of 5, and search for 50 child documents (the "hits expected" as mentioned above). The found documents will then be aggregated into parent documents. 

In case several child docs belong to the same parent, the aggregation may result in less parents than asked for. In this case, if there are additional child documents to query, elasticsearch will expand the query to include more child doc, using the incremental_factor parameter.

The total_hits in the response would not be accurate if the "estimated hits size" is less than the number of child documents which actually match the query. The larger the "estimated hits size" is (controlled by the factor parameter), the larger the potentiall total_hits. But this of course hurts performance.

An additional factor to be aware of, is that the x amount of parent documents is the number of docs returned by the TopChildren query itself. This amount may be further reduced by adjacent or higher -level queries/filters.
If this short explanation clarifyed things for you, please leave a comment and let me know :)

The writer is R&D team leader at Niloosoft Hunter HRMS