Posts Tagged ‘scaling’

A new Meetup for Lucene & Solr

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

BioSolr begins with a workshop day

Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

London Search Meetup – Serious Solr at Bloomberg & Elasticsearch 1.0

The financial information service Bloomberg hosted last Friday’s London Search Meetup in their offices on Finsbury Square – the venue had to be seen to be believed, furnished as it is with neon, chrome, modern art and fishtanks. A slight step up from the usual room above a pub! The first presenter was Ramkumar Aiyengar of Bloomberg on their new search system, accessed via the Bloomberg terminal (as it seems is everything else – Ramkumar even opened his presentation file and turned off notifications from his desk phone from within this application).

Make no mistake, Bloomberg’s requirements are significant: 900,000 new stories from 75,000 sources and 8 million manual searches every day with another 350,000 stored searches running automatically. Some of these stored searches are Boolean expressions with up to 20,000 characters and the source data is also enhanced with keywords from a list of over a million tags. Access Control Lists (ACLs) for security and over 40 languages are also supported, with new stories becoming searchable within 100ms. What is impressive is that these requirements are addressed using the open source Apache Lucene/Solr engine running 256 index shards, replicated 4 times for a total of 1024 cores, on a farm of 32 servers each with 256GB of RAM. It’s interesting to wonder if many closed source search engines could cope at all at this scale, and slightly scary to think how much it might cost!

Ramkumar explained how achieving this level of performance had led them to expose (and help to fix) quite a few previously unknown race conditions in Solr. His team had also found innovative ways to cope with such a large number of tags – each has a confidence value, say 70%, and this can be used to perform a kind of TF/IDF ranking by effectively adding 70 copies of the tag to a document. They have also developed an XML-based query parser for their in-house query syntax (althought in the future the JSON format may be used) and have contributed code back to Solr (for those interested, Bloomberg have contributed to SOLR-839 and are also looking at SOLR-4351).

For the monitoring requirement, we were very pleased to hear they are building an application based on our own Luwak stored query engine, which we developed for just this sort of high-performance application – we’ll be helping out where we can. Other future plans include relevance improvements, machine translation, entity search and connecting to some of the other huge search indexes running at Bloomberg, some on the petabyte scale.

Next up was Mark Harwood of Elasticsearch with an introduction to some of the features in version 1.0 and above. I’d been lucky enough to see Mark talk about some of these features a few weeks before so I won’t repeat myself here, but suffice it to say he again demonstrated the impressive new Aggregrations feature and raised the interesting possibility of market analysis by aggregating over a set of logged queries – identifying demand from what people are searching for.

Thanks to Bloomberg, Ramkumar, Mark and Tyler Tate for a fascinating evening – we also had a chance to remind attendees of the combined London & Cambridge Search Meetup on April 29th to coincide with the Enterprise Search Europe conference (note the discount code!).

Convergence and collisions in Enterprise Search

At the end of next month I’ll be at Enterprise Search Europe (I’m on the programme committee and help with the open source track) and the opening keynote this year is from Dale Roberts, author of the book Decision Sourcing. Dale will be talking about how Social, Big Data, Analytics and Enterprise Search are on a collision course and business leaders ignore these four themes at their peril.

So I wondered if we could see how in practical terms one might build systems based on these four themes. There are technical and logistical challenges of course (not least convincing someone to pay for the effort) but it’s worth exploring nonetheless.

Social in a business context can mean many things: social media is inherently noisy (and as far as I can see mostly cats) but when social tools are used within a business they can be a great way to encourage collaboration. We ourselves have added social features to search applications – user tagging of search results for example, to improve relevance for future searches and to help with de-duplication. Much has been made of the idea of finding not just relevant documents, but the subject matter experts that may have written them, or just other people in your organisation who are interested in the same subject. From a technical point of view none of this is particularly hard – you just have to add these social signals to your index and surface them in some intuitive way – but getting a high enough percentage of users to contribute to shared discussions and participate in tagging can be difficult.

Big Data is an overused term – but in a business context people usually apply it to very large collections of log files or other data showing how your customers are interacting with your business. A lot of search engine experts will tell you that Big Data isn’t always that ‘big’ – we’ve been dealing with collections of hundreds of millions or even billions of indexed items for many years now, the trick is scaling your solution appropriately (not just in technical terms, but in an economic way, as linearly as possible). If you’ve got a few million items, I’m sorry but you haven’t got Big Data, you’ve just got some data.

I’ve always been unsure of the benefits of search Analytics but I’m beginning to change my mind, having seen a some very impressive demos recently. Search engines have always counted things; the clever bit is allowing for queries that can surface unusual or interesting information, and using modern visualisation techniques to show this. Knowing the most popular search term may not be as important as spotting an unexpected one.

So we’ve indexed our data including tags, personnel records, internal chatrooms; put them all onto a elastically scalable platform and built some intuitive and useful interfaces to search and analyze our data. I’m pretty sure you could do all this with the open source technologies we have today (including Scrapy, Apache Lucene/Solr, Elasticsearch, Apache Hadoop, Redis, Logstash, Kibana, JQuery, Dropwizard, Python and Java). This isn’t the whole story though: you’d need a cross-disciplinary team within your organisation with the ability to gather user requirements and drive adoption, a suitable budget for prototyping, development and ongoing support and refinements to the system and a vision encompassing the benefits that it would bring your business. Not an inconsiderable challenge!

What questions should we be able to ask the system? I’ll leave that as an exercise for the reader.

See you in April! If you’d like a 20% discount on registration use the code HULL20. We’ll also be running an evening Meetup on Tuesday 29th April open to both conference attendees and others.

Principles of Solr application design – part 2 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:

˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃
˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃
˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:

˂requestHandler name="/search" class="solr.SearchHandler"˃
˂lst name="defaults"˃
˂str name="qf"˃content_en content_fr content_jp˂/str˃
˂str name="pf"˃content_en content_fr content_jp˂/str˃
...

When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

Tags: , , , , ,

Posted in Reference, Technical

December 17th, 2013

No Comments »

Lucene Revolution 2013, Dublin: day 2

A slow start to the day, possibly due to the aftereffects of the conference party the night before, but the stadium was still buzzing. I went to Rafal Kuć’s talk on SolrCloud which is becoming the standard way to build scalable Solr installations (we have two projects underway that use it). The shard splitting features in recent releases of Solr were interesting – previously one would either have to re-index the whole collection to a new set of shards, or more often over-allocate the number of shards to cope with a future increase in size, this method allows you to split an existing shard into two.

As our own talk was looming (and we needed to practise) I missed the next session unfortunately, although I hear from colleagues that the talk on High Performance JSON Search and Relational Faceted Browsing was good. We then broke for lunch during which we had a chance to test an idea Upayavira had come up with in the pub the night before: whether leeks are suitable for juggling, given that none of us had brought any proper equipment! They did work, but only briefly – luckily the stadium staff were very good natured about sweeping up the remains afterwards.

Our talk on Turning Search Upside Down: Using Lucene for Very Fast Stored Queries was next, during which I was ably assisted by Alan Woodward who has done the majority of the work during some recent projects for media monitoring companies. We’re hoping to release an open source library, Luwak, based on this work very soon – watch this space!

UPDATE: The video of our talk is now available and so is Luwak!

After an interesting talk next by Adrien Grand on What’s in a Lucene Index (unfortunately as this overran a little, we missed the closing remarks) it was time to say our goodbyes and head home. Thanks to all the Lucidworks team for organising a fascinating and friendly event – all of our team found it interesting and it was great to catch up with friends old and new. See you next time!

Tags: , , , , , ,

Posted in Uncategorized

November 8th, 2013

4 Comments »

Lucene Revolution 2013, Dublin: day 1

Four of the Flax team are in Dublin this week for Lucene Revolution, almost certainly the largest event centred on open source search and specifically Lucene. There are probably a couple of hundred Lucene enthusiasts here and the event is being held at the Aviva Stadium on Landsdowne Road: look out the windows and you can see the pitch! Here are some personal reflections: a number of the talks I attended today have a connection to our own work in media monitoring which we’re talking about tomorrow.

Doug Turnbull’s Test Driven Relevancy was interesting, discussing OSC’s Quepid tool that allows content owners and search experts to work together to tweak and tune Solr’s options to present the right results for a query. I wondered whether this tool might eventually be used to develop a Learning to Rank option for Solr, as Lucene 4 now supports a pluggable scoring model.

I enjoyed Real-Time Inverted Search in the Cloud Using Lucene and Storm during which Joshua Conlin told us about running hundreds of thousands of stored queries in a distrubuted architecture. Storm in particular sounds worth investigating further. There is currently no attempt to reduce or ‘prune’ the set of queries before applying them: Joshua quoted speeds of 4000 queries/sec across their cluster of 8 instances: impressive numbers, but our own monitoring applications are working at 20 times that speed by working out which queries not to apply.

I broke out at this point to catch up with some contacts, including the redoubtable Iain Fletcher of Search Technologies – always a pleasure. After a sandwich lunch I went along to hear Andrzej Bialecki of Lucidworks talk about Sidecar Indexes, a method for allowing rapid updates to Lucene fields. This reminded me of our own experiments in this area using Lucene’s pluggable codecs.

Next was more from the Opensource Connections team, as John Berryman talked about their work to update a patent search application that uses a very old search syntax, BRS. This sounds very much the work we’ve done to translate one search engine syntax into another for various media monitoring companies – so far we can handle dtSearch and we’re currently finishing off support for HP/Autonomy Verity’s VQL (PDF).

This latter issue has got me thinking that perhaps it might be possible to collaboratively develop an open source search engine query language – various parsers could be developed to turn other search syntaxes into this language, and search engines like Lucene (or anything else) could then be extended to implement support for it. This would potentially allow much easier migration between search engine technologies. I’m discussing the concept with various folks at the event this week so do please get in touch if you are interested!

Back tomorrow with a further update on this exciting conference – tonight we’re all off to the Temple Bar area of Dublin for food and drink, generously provided by Lucidworks who should also be thanked for organising the Revolution.

Tags: , , , , , ,

Posted in Technical, events

November 6th, 2013

3 Comments »

Elasticsearch meetup – Duedil, Hadoop and more

I visited the London Elasticsearch User Groupsmeetup last night for the first time, in the rather splendid HQ of Skills Matter just down from Old Street – the venue had a great buzz. The first speaker was Chris Simpson from Duedil who provide UK company information gleaned from Companies House and other sources. He told us about using Elasticsearch to provide faceted search (including some great clickable bar graphs for numerical range facets) and how they bulk index around 9 million company records in about an hour, using Elasticsearch’s alias features to swap in new indexes once they’re ready – so there is no impact on search performance while indexing. He mentioned a common problem with search engines, which is there is no easy way to be sure how much hardware you’ll need until you ‘know your data and know your hosts’.

Next up was Chris Harris from Hortonworks, who provide a packaged and supported Apache Hadoop distribution. He explained how Hadoop can be used for capturing huge numbers of transactions (these could be interactions with an e-commerce website for example) and for storing them in a distributed database on low-cost hardware. The Hive ‘SQL-like’ language can then be used to extract the data and send it directly to Elasticsearch, or indeed to run queries on Elasticsearch and send the results back to Hadoop as a table. Similar processes can be run with the Pig scripting language. There followed some interesting discussions about the future of Hadoop, where search engines such as Elasticsearch may run directly on Hadoop nodes, working with the data locally. It will be interesting to compare this with the approach taken by Cloudera who are talking on Hadoop & Solr this Thursday at our own Meetup in Cambridge.

Clinton Gormley from Elasticsearch finished up with a Q&A, during which he talked about the new Phrase Suggesters based on Lucene’s new Finite State Machines, and gave hints about when the long awaited 1.0 release of Elasticsearch will appear – apparently early 2014 is now likely.

Thanks to all the speakers and to Elasticsearch for the very welcome beer and pizza – this certainly won’t be our last visit to this user group on what is an increasingly adopted open source search engine.

An open approach to tuning search for gov.uk

Roo Reynolds from the GDS team has written a great blog post about the ongoing process of tuning the search for gov.uk which I can highly recommend.

We regularly see situations where a search project has been set up as ‘fire and forget’ – which is never a good idea: not only does content grow, but user needs change and search requirements evolve, whatever the application. Search should be a living project: monitoring user behaviour should reveal not just which searches ‘work’ (i.e. the user gets some results which they then click on) but more important which ones don’t. For example, common mispellings or acronyms might be a useful addition to a synonym list; if average search response times are lengthening then it might be time to consider performance tuning or even scaling out; the constant use of the ‘Next 10 Results’ button might indicate a problem with relevance ranking.

Luckily any improvements to gov.uk made by the GDS team should appear in their Github repository at some point – as I mentioned before the GDS team are (very sensibly) committed to an open source approach.

Tags: , , ,

Posted in Reference, Technical

June 12th, 2013

No Comments »

Phony wars: the battle between Solr and Elasticsearch

The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:

  • Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!
  • Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!
  • There are more books about Solr than Elasticsearch…a point to Solr!
  • Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!

This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).

Tags: , , , ,

Posted in Business, Technical

January 14th, 2013

3 Comments »