Posts Tagged ‘real time’

Search backwards – media monitoring with open source search

We’re working with a number of clients on media monitoring solutions, which are a special case of search application (we’ve worked on this previously for Durrants). In standard search, you apply a single query to a large amount of documents, expecting to get a ranked list of documents that match your query as a result. However in media monitoring you need to search each incoming document (for example, a news article or blog post) with many queries representing what the end user wants to monitor – and you need to do this quickly as you may have tens or hundreds of thousands of articles to monitor in close to real time (Durrants have over 60,000 client queries to apply to half a million articles a day). This ‘backwards’ search isn’t really what search engines were designed to do, so performance could potentially be very poor.

There are several ways around this problem: for example in most cases you don’t need to monitor every article for every client, as they will have told you they’re only interested in certain sources (for example, a car manufacturer might want to keep an eye on car magazines and the reviews in the back page of the Guardian Saturday magazine, but doesn’t care about the rest of the paper or fashion magazines). However, pre-filtering queries in this way can be complex especially when there are so many potential sources of data.

We’ve recently managed to develop a method for searching incoming articles using a brute-force approach based on Apache Lucene which in early tests is performing very well – around 70,000 queries applied to a single article in around a second on a standard MacBook. On suitable server hardware this would be even faster – and of course you have all the other features of Lucene potentially available, such as phrase queries, wildcards and highlighting. We’re looking forward to being able to develop some powerful – and economically scalable – media monitoring solutions based on this core.

Enterprise Search Meetup: exploratory search, TravelMatch and Stephen Arnold

Last night I went to another excellent Enterprise Search London Meetup, at Skinkers near London Bridge. I’d been at the Online show all day, which was rather tiring, so it was great to sit down with beer and nibbles and hear some excellent speakers.

Max Wilson kicked off with a talk on exploratory search and ’searching for leisure’. His Search Interface Inspector looks like a fascinating resource, and we heard about how he and his team have been constructing a taxonomy for the different kinds of search people do, using Twitter as a data source.

Martina Schell was next with details of Travel Match, a holiday search engine that’s trying to do for holidays what our customer Mydeco is doing for interior design: scrape/feed/gather as much holiday data as you can, put it all into a powerful search engine and build innovative interfaces on top. They’ve tried various interfaces including a ‘visual search’, but after much user testing have reined back their ambitions somewhat – however they’re still unique in allowing some very complex queries of their data. Interestingly, one challenge they identified is how to inform users that one choice (say, airport to fly from) may affect the available range of other choices (say, destinations) – apparently users often click repeatedly on ‘greyed-out’ options, unsure as to why they’re not working…

The inimitable Stephen Arnold concluded the evening with a realistic treatment of the current fashion for ‘real-time’ search. His point was that unless you’re Google, with their fibre-connected, hardware-accelerated gigascale architecture, you’re not going to be able to do real-time web search or anything close to it; on a smaller scale, for financial trading, military and other serious applications you again need to rely on the hardware – so for proper real-time (that means very close to zero latency), your engineering capability, not your software capability is what counts. I’m inclined to agree – I trained as an electronic engineer and worked on digital audio, back when this was also only possible with clever hardware design. Of course, eventually the commodity hardware gets fast enough to move away from specialised devices, and at this point even the laziest coder can create responsive systems, but we’re far away from that point. Perhaps the marketing departments of some search companies should take note – if you say you can do real-time indexing, we’re not going to believe you.

Thanks again to Tyler Tate and all at TwigKit for continuing to organise and support this excellent event.

Predictions

A new year, and a chance to think about what might happen in the world of enterprise search over the next twelve months. I’ll make a stab at some predictions:

  1. Price cuts – possibly driven by even harsher competition between Google and Microsoft FAST, I can see prices coming down for packaged enterprise search. Autonomy will probably raise theirs :-)
  2. Real time search matures – not just Twitter or Facebook, but real time data from many sources being part of enterprise search results
  3. More geolocation-aware search – in the U.K. at least, we’re seeing signs that the source data is finally being freed up, which should make it a lot simpler and cheaper to build location-aware solutions
  4. A few less second-tier players in the market – it’s still difficult out there, I’m afraid not every company will survive the next year.

You’re welcome to take any of these with a generous pinch of salt!

Tags: , ,

Posted in Business

January 20th, 2010

1 Comment »

Online Information 2009, day 1

I visited the Online Information exhibition yesterday at Olympia. My first impression was that the exhibition area was very quiet – and a few of the exhibitors agreed with me. The current financial situation would seem the obvious cause. At previous shows exhibitors have given away all kinds of freebies, from bags, to mini mice, to branded juggling balls….but this year you’d be lucky if you came away with a couple of free pens and a boiled sweet.

I dropped in on the associated conference later, and caught a presentation titled “The Real Time Web: Discovery vs. Search”. Antonio Gulli of Microsoft told us about their new European offices, including one in Soho, that were concentrating on bringing new features to Bing – but the results look very familiar, is Bing doomed to play catch-up? The only ‘real time’ feature he discussed was indexing Twitter, although apparently they’ll soon be indexing Facebook as well. Surely real time encompasses more than these two platforms?

Stephen Arnold gave us his thoughts on what we should mean by ‘real time’, sensibly talking about how the financial services world has been using real time systems for many years. He also injected some notes of caution about how difficult it is to trust information spread amongst peers on social networking sites – here’s a recent case, read further down the page for a great quote from Graham Cluley.

Someone from Endeca (I didn’t catch the name, he was replacing the published speaker) showed us lots of slides of various applications of search, but his theme seemed more about how search can replace traditional databases than about ‘real time’, something I’ve blogged about recently.

We finished with Conrad Wolfram, demonstrating Wolfram Alpha, which isn’t really a search engine but rather a computation engine – it tries to give you a set of answers, rather than a list of possible resources where the answer might be found. Not a lot of ‘real time’ here either.

I’m back on Thursday as part of the closing keynote panel.

Tags: ,

Posted in Uncategorized, events

December 2nd, 2009

No Comments »

When real-time search isn’t

Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.

From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.

A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.

The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.

In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.

Tags: ,

Posted in News, Technical

November 5th, 2009

2 Comments »