Posts Tagged ‘SOLR’
Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.
Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.
Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.
Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.
Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.
Another excellent evening as part of the Enterprise Search London Meetup series; very busy as usual.
Amir Dotan started us off with details of his work in designing user interfaces for the financial services sector, describing some of the challenges involved in designing for a high-pressure and highly regulated environment. Although he didn’t talk about search specifically we heard a lot about how to design useful interfaces. Two quotes stood out: “The right user interface can help make billions”, and as a way to get feedback “find someone nice in the business and never let them go”.
Gregory Grefenstette of Exalead was next, talking about his new book on Search Based Applications. He explained how SBAs have advantages over traditional databases in the three areas of agility, usability and performance and went on to show some examples, before an unfortunate combination of a broken slide deck and a failing laptop battery brought him to a halt: in retrospect a great advertisement for a physical book over a computer!
Upayavira of Sourcesense was next with details of a new search built for online news aggregator Moreover. This dealt with scaling Lucene/Solr to cope with indexing 2 million new documents a day, for a rolling 2 month index. He showed how some initial memory and performance problems had been solved with a combination of pre-warming caches, tweaks to the JVM and Java garbage collector and eventually profiling of their custom code. Particularly interesting was how they had developed a system for spinning up a complete copy of the searchable database (for load balancing purposes) on the Amazon EC2 cloud – from a standing start they can allocate servers, install software and copy across searchable indexes in around 40 minutes. This was a great demonstration of the power of the open source model – no more licenses to buy! Search performance over this large collection is pretty good as well, with faceted queries returning in a second or two and unfaceted in half a second.
We also heard from Martin White about an exciting new search related conference to be held in October this year in London in association with Information Today, Inc., and I managed a quick plug for our inaugural Cambridge Enterprise Search Meetup on Wednesday 16th February.
Analysts Ovum have released a report on enterprise search – it’s not clear where to obtain it yet, although Report Linker may have it available. According to one report it may also be called “Enterprise Search and Retrieval: Exploiting all of the Organisation’s Information Assets”.
Interestingly most of the press coverage around the release is focussing on the author, Mike Davis’s statements about open source solutions – in particular “…in fact, companies should only go to the big proprietary players if open source can’t deliver what they need. “. He also states that “there are mere nuances between those ranked” – and this includes the open source option Solr 1.4.
This is the clearest statement yet from an analyst that enterprise search engines are all pretty much the same thing, if you strip away the marketing – but more importantly, that open source should be the first option to consider.
I spent yesterday afternoon at the International Society for Knowledge Organisation’s Legal KnowHow event, a series of talks on legal knowledge and how it is managed. The audience was a mixture of lawyers, legal information managers, vendors and academics, and the talks came from those who are planning legal knowledge systems or implementing them. I also particularly enjoyed hearing from Adam Wyner from Liverpool University who is modelling legal arguments in software, using open source text analysis. You can see some of the key points I picked up on our Twitter feed.
What became clear to me during the afternoon is that search technology is not currently serving the needs of lawyers or law firms. The users want a simple Google-like interface (or think they do), the software is having trouble presenting results in context and the source data is large, complex and unwieldy. The software used for search is from some of the biggest commercial search vendors (legal firms seem to ‘follow the pack’ in terms of what vendor they select – unfortunately few of the large law firms seem to have even considered the credible open source alternatives such as Lucene/Solr or Xapian).
In many cases taxonomies were presented as the solution – make sure every document fits tidily into a heirarchy and all the search problems go away, as lawyers can simply navigate to what they need. All very simple in theory – however each big law firm and each big legal information publisher has their own idea of what this taxonomy should be.
After the final presentation I argued that this seemed to be a classic case where an open source model could help. If a firm, or publisher were prepared to create an open source legal taxonomy (and to be fair, we’re only talking about 5000 entries or so – this wouldn’t be a very big structure) and let this be developed and improved collaboratively, they would themselves benefit from others’ experience, the transfer of legal data between repositories would be easier and even the search vendors might learn a little about how lawyers actually want to search. The original creators would be seen as thought-leaders and could even license the taxonomy so it could not be rebadged and passed off as original by another firm or publisher.
However my plea fell on stony ground: law firms seem to think that their own taxonomies have inherent value (and thus should never be let outside the company) and they regard the open source model with suspicion. Perhaps legal search will remain broken for the time being.
If you’re considering a Lucene/Solr powered search solution, you may be interested in LucidWorks Enterprise, produced by our partners Lucid Imagination. They’ve taken Lucene/Solr and added a powerful admin GUI, ReST API, web spiders, file crawlers, database connectors, alerts, a clickthrough framework and more. All this comes with a range of excellent support options backed by the experts at Lucid.
If you’d like to know more read this downloadable PDF or contact us for more information and a demo.
I spent yesterday at Search Solutions 2010, hosted by the British Computer Society. They’d been kind enough to ask me to speak (Update: my slides are available here, the rest are available at the event website above), but there were plenty of other people to listen to as well. There’s a great blow-by-blow account from Tyler Tate already, but here are some personal highlights:
Google’s Behshad Behzadi spoke about freshness for web content and how Google’s usual ranking strategy favours older results over new ones – as the new ones don’t have so many links. Vishwa Vinay from Microsoft talked on what to do with click data in enterprise search – he listed lots of papers on the subject, hopefully his slides will be published so we can follow them up. He made the point that any ‘adaptive’ ranking based on click data must still work well out of the box, before any clicks have happened. This section of the event finished with Vivian Lin Dufour of Yahoo!, talking about some ways of guiding searchers from within the UI, with auto-suggest and similar techniques. Apparently the research the Yahoo team are doing on trending has let them spot news stories 12-24 hours before they hit the papers. I wondered afterwards, is this current fad for ‘trendspotting’ turning search engines into just a media channel? I don’t care much about the X-Factor TV show myself, so why should this current trend influence my search results?
Nick Patience started the next session talking about trends in the Enterprise Search market: he acknowledged the rapid rise of open source solutions and talked about how search-based applications will become increasingly important, with a huge market for ‘information governance’ solutions opening up. Chirag Ghandhi of Mphasis, a search integrator, talked about how customers are disillusioned with enterprise search, and how difficult it is to build solutions that cope with data from a range of different sources and in different languages. Dusan Rnic of Endeca stressed the importance of being able to handle the ‘long tail’ of search results – the ones that aren’t the most popular and showed us his favourite website – strangely enough, an Endeca customer.
Greg Lyndahl talked about how Blekko have built an innovative web crawling/indexing framework, which has enabled them to build up a 3 billion page index very efficiently – looking forward to seeing more of this. As he said, what they’re doing isn’t necessarily better than Google, but it’s certainly different. My talk on open source search for news content followed, and then Roberto Cornacchia showed us Spinque’s approach to building search platforms – encapsulating search expert knowledge into logical ‘blocks’ that can be combined by domain experts into the solutions they actually need.
The last session began with Till Kinstler of GBV Common Library Network, a self-described ‘library hacker’, on building a search system using the open source engine Solr over 25 million library records – they’re now aiming for 120 million, taken from 400 different libraries, in source formats going all the way back to tape and paper library cards! We then heard about the Information Retrieval Facility, an open IR research institution – I liked their three principles of ‘open science, open source, open market’. The talks finished with Rob Stacey on True Knowledge’s ways of checking the veracity of facts gathered from the internet.
We then moved on to an open panel – some great themes here including the rise of search as a platform for new applications, what exciting (or scary) things Facebook might bring to the world of search, and how we should all work harder to bring good information retrieval mechanisms to those who cannot currently access them due to poverty, language barriers or disability.
Thanks to the BCS IRSG and in particular to Udo Kruschwitz for a very interesting and enlightening day.
A fascinating event last night as the Guardian team told us more about how they’ve used open source search technology to build their new open platform. The presentations were brief and to-the-point, and covered how the team have created a detailed, rich API to their news content, all built on the open source engine Apache Solr – opening up Guardian Media Group content to the world for mashups, repurposing and innovative new business models.
The Guardian have an existing Oracle database with J2EE web applications to serve content, but discovered that certain operations such as returning content with multiple tags, or dynamically generated ‘related’ content, were very database-intensive and difficult to scale. The use of Solr effectively flattens the cost of these complex queries, and also allows them to scale up capacity on demand by simply spinning up more Solr instances on the Amazon EC2 cloud . Interestingly, site search for the Guardian website doesn’t yet use Solr, although they hope to move this across soon.
What we’re seeing here is a change in how search technology is used especially by forward-looking organisations – from being a bolt-on to an existing website or application, search is now the platform for new developments. I’ll be talking about other ways open source search has been used for news content at the British Computer Society this coming Thursday 21st October – I believe there are still a few places available.
Back for the second day of Lucene Revolution, with some great talks on migrating to Solr from FAST ESP, the new flexible indexing features coming to Lucene ‘real soon now’, and finishing off with a panel discussion. I felt privileged to sit as part of this panel between Eric Gries, CEO of Lucid Imagination, and Paul Doscher of Exalead – the discussion was lively and interesting (I hope!) to the audience.
I’m looking forward to returning to the UK with all I’ve learnt from this event, and to follow up on some of the ideas generated – for example, it would be great to be able to demonstrate Lucid Works Enterprise to interested parties in London.
Thanks to Stephen Arnold’s team and all at Lucid Imagination for organising such a great conference. It won’t be the last I’m sure!
I’m at the Lucene Revolution conference in Boston, USA for the next few days – and it’s aptly named. If there’s anyone out there who still doubts that open source search is a serious alternative to a commercial engine, the numbers and other information coming out of this event will be convincing. Twitter are now using Lucene to handle a billion queries a day; LinkedIn and SalesForce.com are already veterans with similarly huge installations. The conversations I’m having and overhearing are about billions of documents, tens of thousands of users, all easily handled by open source search.
The other big news here is that Lucid Imagination have released software to fill in most if not all of the gaps between Lucene/Solr and the closed-source competition – it’s called LucidWorks Enterprise and adds a detailed administration UI, a REST API, crawlers, scaling functionality and much more. I’m looking forward to getting my hands on a demo and showing it off when back in the UK.
There’s an optimistic, buzzing energy at this event – a real feeling that we’re here at the beginning of something big. More revolutionary news to come!
We’re very happy to announce that we’ve been selected as an Authorized Partner by Lucid Imagination, the commercial company for Lucene and Solr. You can read the press release as a PDF here.
Apache Lucene and Solr, available as open source software from the Apache Software Foundation, are powerful, scalable, reliable and fully-featured search technologies. Solr is the Lucene Search Server, making it easy to build search applications for the enterprise.
With our long experience of customising, installing and supporting open source search engines, this partnership is a natural fit for us, and we’re excited by the opportunities it presents. In addition to our current offerings, Flax will now offer installation, integration and commercial support packages for Lucene and Solr, backed by Lucid Imagination.