When real-time search isn't

Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.

From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.

A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.

The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.

In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.

2 thoughts on “When real-time search isn't

    • It depends what you mean by ‘constantly changing’! We have various customers working with daily news content for example, where stories might change during the day, and in this case we might store a persistent ID for a particular news story so we can replace the indexed terms later. One advantage we have is that updates to the index are immediately searchable.

Leave a Reply

Your email address will not be published. Required fields are marked *