By: Tom

Tom — Fri, 03 Dec 2010 14:14:02 +0000

“the search will run on a databases of 1/3rd size of original, so it will be faster.”

That’s mostly incorrect. The search time is proportional not to the size of the database, but to the number of blocks read. In this case, the number of block reads will be the same. So the search time will be just as long. Think of it this way: reading the first 1000 bytes of a 10GB flat file is much faster than reading the whole file.

Regarding the relevance issue, in Xapian, full statistics are exchanged by the remote protocol so the ranking will be exactly the same as for as for a single database. This is not the case in SOLR, and so you have to be careful that each shard is “balanced” in terms of similar statistics, otherwise the final ranking will be skewed.

– Tom

By: James

James — Fri, 03 Dec 2010 14:02:43 +0000

I would disagree with “not provide any performance gain at all” for Figure 2 model – the search will run on a databases of 1/3rd size of original, so it will be faster.

What I do not understand in both (figure 2 and figure 3) models is when I want say first 10 results and I am getting 10 results from each server – how can I compare the relevance of results returned from each server and sort them by relevance?

Comments on: Distributed search and partition functions

By: Tom

By: James