Posts Tagged ‘lucene’

Free file filters, search & taxonomy tools from our old Googlecode repository

Google’s GoogleCode service is closing down, in case you hadn’t heard, and I’ve just started the process of moving everything over to our Github account. This prompted me to take a look at what’s there and there’s a surprising amount of open source code I’d forgotten about. So, here’s a quick rundown of the useful tools, examples and crazy ideas we’ve built over the years – perhaps you’ll find some of it useful – please do bear in mind however that we’re not officially supporting most of it!

  • Flax Basic is a simple enterprise search application built using Python and the Xapian search library. You can install this on your own Unix or Windows system to index Microsoft Office, PDF, RTF and HTML files and it provides a simple web application to search the contents of the files. Although the UI is very basic, it proved surprisingly popular among small companies who don’t have the budget for a ‘grown up’ search system.
  • Clade is a proof-of-concept classification system with a built-in taxonomy editor. Each node in the taxonomy is defined by a collection of words: as a document is ingested, if it contains these words then it is attached to the node. We’ve written about Clade previously. Again this is a basic tool but has proved popular and we hope one day to extend and improve it.
  • Flax Filters are a set of Python programs for extracting plain text from a number of common file formats – which is useful for indexing these files for search. The filters use a number of external programs (such as Open Office in ‘headless’ mode) to extract the data.
  • The Lucene Redis Codec is a (slightly crazy) experiment in how the Lucene search engine could store indexed data not on disk, but in a database – our intention was to see if frequently-updated data could be changed without Lucene noticing. Here’s what we wrote at the time.
  • There’s also a tool for removing fields from a Lucene index, a prototype web service interface with a JSON API for the Xapian search engine and an early version of a searchable database for historians, but to be honest these are all pre-alpha and didn’t get much further.

If you like any of these tools feel free to explore them further – but remember your hard hat and archeology tools!

Tags: , , ,

Posted in Technical

March 19th, 2015

No Comments »

Rebrands and changing times for Elasticsearch

I’ve always been careful to distinguish between Elasticsearch (the open source search server based on Lucene) and Elasticsearch (the company formed by the authors of the former) and it seems someone was listening, as the latter has now rebranded as simply Elastic. This was one of the big announcements during their first conference, the other being that after acquiring Norwegian company Found they are now offering a fully hosted Elasticsearch-as-a-service (congratulations to Alex and others at Found!). As Ben Kepes of Forbes writes, this may be something to do with ‘managing tensions within the ecosystem’ (I’ve written previously on how this ecosystem is expanding to include closed-source commercial products, which may make open source enthusiasts nervous) but it’s also an attempt to move away from ’search’ into a wider area encompassing the buzzwords-de-jour of Big Data Analytics.

In any case, it’s clear that Elastic (the company, and that’s hopefully the last time I’ll have to write this!) have a clear strategy for the future – to provide many different commercial options for Elasticsearch and its related projects for as many different use cases as possible. Of course, you can still take the open source route, which we’re helping several clients with at present – I hope to be able to present a case study on this very soon.

Meanwhile, Martin White has identified how a recent book on Elasticsearch describes literally hundreds of features and that ‘The skill lies in knowing which to implement given the nature of the content and the type of query that will be used’ – effective search, as ever, remains a difficult thing to get right, no matter what technology option you choose.

UPDATE: It seems that www.elasticsearch.org, the website for the open source project, is now redirecting to the commercial company website…there is now a new Github page for open source code at https://github.com/elastic

Tags: , , , ,

Posted in News

March 11th, 2015

1 Comment »

Lucene/Solr London User Group – Alfresco & Datastax

We had another London user group Meetup last week, hosted by Reed.co.uk who also provided some tasty pizza – eaten under the ‘Love Mondays’ sign from their adverts, which now lives in their boardroom! A few new faces this time and a couple of great talks from two companies who have incorporated Solr into their platforms.

First up was Andy Hind, a founding developer of document management company Alfresco, who told us all about how they originally based their search capability on Lucene 2.4, then moved to Solr 4.4 and most recently version 4.9.1. Using Solr they have implemented often complex security requirements (originally using a PostFilter as Erik Hatcher describes and more recently in the query itself), structured queries (using Phrase and SpanQueries) and their own domain specific query language (DSL) – they can support SQL-like, Lucene and Google-like queries by passing them through parsers based on ANTLR to be served either by the search engine or whatever relational database Alfresco is using. The move to a recent version of Solr has allowed the most recent release of Alfresco to support various modern search features (facets, spelling suggestions etc.) but Andy did mention that so far they are not using SolrCloud for scaling, preferring to manage this themselves.

Next up was Sergio Bossa of Datastax, talking about how their Datastax Enterprise (DSE) product incorporates Solr searching within an Apache Cassandra cluster. Sergio has previously spoken at our Cambridge search meetup on a very similar subject, so I won’t repeat myself here, but the key point is that Solr lives directly on top of the Cassandra cluster, so you don’t have to worry about it at all – search features are directly available from the Cassandra APIs. Like Alfresco, this is an alternative to SolrCloud (assuming you also need a NoSQL database of course!).

Thanks again to Alex Rice for hosting the Meetup, to both our speakers and to all who came – we’ll return soon! In the meantime you may want to check out a few events coming later this year: Berlin Buzzwords, ApacheCon Europe and Lucene/Solr Revolution.

Tags: , , , ,

Posted in Technical, events

February 16th, 2015

No Comments »

Out and about in January and February

We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon.

Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels (31st January-1st February) on Searching over streams with Luwak and Apache Samza – about some fascinating work they’ve been doing to combine the powerful ‘reverse search’ facilities of our Luwak library with Apache Samza’s distributed, stream-based processing. We’re hoping this means we can scale Luwak beyond its current limits (although those limits are pretty accomodating, as we know of systems where a million or so stored searches are applied to a million incoming messages every day). If you’re interested in open source search the Devroom they’re speaking in has lots of other great talks planned.

Next I’m talking about the wider applications of this kind of reverse search in the area of media monitoring, and how open source software in general can help you turn your organisation’s infrastructure upside down, at the Intrateam conference event in Copenhagen from February 24th-26th. Scroll down to find my talk at 11.35 am on Thursday 26th.

If you’d like to meet us at either of these events do get in touch.

Comparing Solr and Elasticsearch – here’s the code we used

A couple of weeks ago we presented the initial results of a performance study between Apache Solr and Elasticsearch, carried out by my colleague Tom Mortimer. Over the last few years we’ve tested both engines for client projects and noticed some significant performance differences, which we thought deserved fuller investigation.

Although Flax is partnered with Solr-powered Lucidworks we remain completely independent and have no particular preference for either Solr or Elasticsearch – as Tom says in his slides they’re ‘both awesome’. We’re also not interested in scoring points for or against either engine or the various commercial companies that are support their development; we’re actively using both in client projects with great success. As it turned out, the results of the study showed that performance was broadly comparable, although Solr performed slightly better in filtered searches and seemed to support a much higher maximum queries per second.

We’d like to continue this work, but client projects will be taking a higher priority, so in the hope that others get involved both to verify our results and take the comparison further we’re sharing the code we used as open source. It would also be rather nice if this led to further performance tuning of both engines.

If you’re interested in other comparisons between Solr and Elasticsearch, here are some further links to try.

Do let us know you get on, what you discover and how we might do things better!

More than an API – the real third wave of search technology

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including Found.no and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.

Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup

It’s looking like a busy Autumn for search events – first, I’m presenting at Enterprise Search & Discovery 2014 in Washington DC on November 5th, talking about ‘Turning Search Upside Down with open source software’. I’ll be describing how we’ve replaced various underperforming, big name closed source search engines with faster & more scalable open source technology, including our own Luwak stored query engine. Do let me know if you’re in DC, I’d be very happy to meet up. The week after this is Lucene Revolution, which sadly we won’t be attending this year, but it is recommended if you’re interested in Lucene and Solr.

Towards the end of November there’s Search Solutions, a great day of presentations about all aspects of search held at the British Computer Society in Covent Garden. This year Tom Mortimer from Flax will be presenting some research we’ve done into performance comparisons between Lucene/Solr and Elasticsearch, and there are also presentations from Thomson Reuters, the British Library, Microsoft, Yahoo! and Google. I highly recommend this event, it’s always worth attending.

We’re also starting a new Meetup in London, a group for users of Apache Lucene/Solr (there’s an Elasticsearch London user group but strangely no equivalent for the other popular stack). Our first event is on November 28th, kindly hosted by Bloomberg (who are no strangers to Lucene/Solr themselves) and featuring Shalin Mangar, a Lucene/Solr committer from Lucidworks who is visiting Europe that week. We’re hoping that we can run these events every few months, but we need help from the community, so if you could talk, sponsor or host the Meetups do let us know.

In December we’ll be holding another Cambridge Search Meetup and will be talking about our work with the European Bioinformatics Institute on the BioSolr project – the date to be confirmed. Busy times!

BioSolr begins with a workshop day

Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz

This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem to affect the attendance too much. We started with a keynote from Dale Roberts, whose book on Decision Sourcing inspired a talk about a ‘rational decision making model’. When examining traditional relational database applications Dale said ‘if you peer at it long enough you can see the rows and columns’ and his point was that modern consumer social networking applications don’t exhibit this old pattern – so this is where search application designers should look for inspiration. His co-presenter Rooven Pakkiri said that Enterprise Search should attempt to ‘release the information from inside our heads’, which of course social networking might help with, connecting you with colleagues. I’m not sure that one can easily take lessons learnt from consumer applications and apply them to business use, and some later speakers agreed with me, but this was a high-energy and thought-provoking start.

Next I chaired the Open Source track, where we started with Cedric Ulmer of France Labs, who talked about a search application they built for a consultancy business with around 40 employees. Using Apache Solr, Apache ManifoldCF and their own Datafari open source framework they turned this project around very quickly – interestingly, the end clients needed no training to use the new system, which implies a very well designed UI. Our second talk from Ronald Hobbs of Reed Business International described a project on a much larger scale: 100 million documents, 72 business units and up to 190 queries per second – this was originally served by the FAST ESP engine but they moved to an Apache Solr system, replacing the FAST processing pipeline with Search Technologies Aspire project. His five steps for an effective migration (Prepare, Get the right tools, Get the right team, Migrate in chunks, Clean up) I can only agree with from our own experience of such projects, including one from FAST ESP to Solr. I was amused by his description of the Apache Zookeeper project as ‘a bipolar manic depressive’, although it seemed this was eventually overcome with a successful deployment on Amazon EC2. Next was Galina Hinova of Intrafind on a aftersales search application for MAN Truck and Bus – again at serious scale (MAN have around 1 billion vehicles in existence with 100-150 documents related to each). Interestingly the Euro6 regulations for emissions and standardized EU terms for automobile parts were direct drivers of the project, with Apache Lucene as the base technology. No longer is open source search just for small-scale projects it seems!

After a short break during which I chatted to John Newton, founder of Documentum Alfresco, and his team we returned to hear Dan Jackson give a description of how UCL had improved their website search – with a chaotic mix of low quality content and an ‘awful’ content management system, the challenges were myriad but with the help of experts such as our associate Tony Russell-Rose they have made significant improvements. Next was what was to prove a very popular talk from Nick Brown of AstraZeneca on a huge, well funded project to build applications to support research and development – again, this was at large scale with 75 million documents (including ‘all the patents and all the research papers’). The key here was their creation of many well-targeted ‘apps’ to enable particular uses of the Sinequa search engine they chose for the back end, including mobile apps to help find others in the company (or external to it) who are also working on a particular drug or disease. This presentation showed just what can be achieved if companies really understand the potential of search technology – knowledge sharing and discovery of previously unknown information.

After a short drinks reception we retired to a nearby pub for the combined Cambridge and London Search Meetup – I’d prepared a short quiz (feel free to have a go!) which was won by Tony Russell-Rose’s team. Networking and chatting continued long into the evening, with some people from the wider UK search community also attending.

To be continued! You can see most of the slides here.

As Hadoop gains, does Lucene benefit?

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!