London Search Meetup – Serious Solr at Bloomberg & Elasticsearch 1.0

The financial information service Bloomberg hosted last Friday’s London Search Meetup in their offices on Finsbury Square – the venue had to be seen to be believed, furnished as it is with neon, chrome, modern art and fishtanks. A slight step up from the usual room above a pub! The first presenter was Ramkumar Aiyengar of Bloomberg on their new search system, accessed via the Bloomberg terminal (as it seems is everything else – Ramkumar even opened his presentation file and turned off notifications from his desk phone from within this application).

Make no mistake, Bloomberg’s requirements are significant: 900,000 new stories from 75,000 sources and 8 million manual searches every day with another 350,000 stored searches running automatically. Some of these stored searches are Boolean expressions with up to 20,000 characters and the source data is also enhanced with keywords from a list of over a million tags. Access Control Lists (ACLs) for security and over 40 languages are also supported, with new stories becoming searchable within 100ms. What is impressive is that these requirements are addressed using the open source Apache Lucene/Solr engine running 256 index shards, replicated 4 times for a total of 1024 cores, on a farm of 32 servers each with 256GB of RAM. It’s interesting to wonder if many closed source search engines could cope at all at this scale, and slightly scary to think how much it might cost!

Ramkumar explained how achieving this level of performance had led them to expose (and help to fix) quite a few previously unknown race conditions in Solr. His team had also found innovative ways to cope with such a large number of tags – each has a confidence value, say 70%, and this can be used to perform a kind of TF/IDF ranking by effectively adding 70 copies of the tag to a document. They have also developed an XML-based query parser for their in-house query syntax (althought in the future the JSON format may be used) and have contributed code back to Solr (for those interested, Bloomberg have contributed to SOLR-839 and are also looking at SOLR-4351).

For the monitoring requirement, we were very pleased to hear they are building an application based on our own Luwak stored query engine, which we developed for just this sort of high-performance application – we’ll be helping out where we can. Other future plans include relevance improvements, machine translation, entity search and connecting to some of the other huge search indexes running at Bloomberg, some on the petabyte scale.

Next up was Mark Harwood of Elasticsearch with an introduction to some of the features in version 1.0 and above. I’d been lucky enough to see Mark talk about some of these features a few weeks before so I won’t repeat myself here, but suffice it to say he again demonstrated the impressive new Aggregrations feature and raised the interesting possibility of market analysis by aggregating over a set of logged queries – identifying demand from what people are searching for.

Thanks to Bloomberg, Ramkumar, Mark and Tyler Tate for a fascinating evening – we also had a chance to remind attendees of the combined London & Cambridge Search Meetup on April 29th to coincide with the Enterprise Search Europe conference (note the discount code!).

Cambridge Search Meetup – six degrees of ontology and Elasticsearching products

Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ’shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

Convergence and collisions in Enterprise Search

At the end of next month I’ll be at Enterprise Search Europe (I’m on the programme committee and help with the open source track) and the opening keynote this year is from Dale Roberts, author of the book Decision Sourcing. Dale will be talking about how Social, Big Data, Analytics and Enterprise Search are on a collision course and business leaders ignore these four themes at their peril.

So I wondered if we could see how in practical terms one might build systems based on these four themes. There are technical and logistical challenges of course (not least convincing someone to pay for the effort) but it’s worth exploring nonetheless.

Social in a business context can mean many things: social media is inherently noisy (and as far as I can see mostly cats) but when social tools are used within a business they can be a great way to encourage collaboration. We ourselves have added social features to search applications – user tagging of search results for example, to improve relevance for future searches and to help with de-duplication. Much has been made of the idea of finding not just relevant documents, but the subject matter experts that may have written them, or just other people in your organisation who are interested in the same subject. From a technical point of view none of this is particularly hard – you just have to add these social signals to your index and surface them in some intuitive way – but getting a high enough percentage of users to contribute to shared discussions and participate in tagging can be difficult.

Big Data is an overused term – but in a business context people usually apply it to very large collections of log files or other data showing how your customers are interacting with your business. A lot of search engine experts will tell you that Big Data isn’t always that ‘big’ – we’ve been dealing with collections of hundreds of millions or even billions of indexed items for many years now, the trick is scaling your solution appropriately (not just in technical terms, but in an economic way, as linearly as possible). If you’ve got a few million items, I’m sorry but you haven’t got Big Data, you’ve just got some data.

I’ve always been unsure of the benefits of search Analytics but I’m beginning to change my mind, having seen a some very impressive demos recently. Search engines have always counted things; the clever bit is allowing for queries that can surface unusual or interesting information, and using modern visualisation techniques to show this. Knowing the most popular search term may not be as important as spotting an unexpected one.

So we’ve indexed our data including tags, personnel records, internal chatrooms; put them all onto a elastically scalable platform and built some intuitive and useful interfaces to search and analyze our data. I’m pretty sure you could do all this with the open source technologies we have today (including Scrapy, Apache Lucene/Solr, Elasticsearch, Apache Hadoop, Redis, Logstash, Kibana, JQuery, Dropwizard, Python and Java). This isn’t the whole story though: you’d need a cross-disciplinary team within your organisation with the ability to gather user requirements and drive adoption, a suitable budget for prototyping, development and ongoing support and refinements to the system and a vision encompassing the benefits that it would bring your business. Not an inconsiderable challenge!

What questions should we be able to ask the system? I’ll leave that as an exercise for the reader.

See you in April! If you’d like a 20% discount on registration use the code HULL20. We’ll also be running an evening Meetup on Tuesday 29th April open to both conference attendees and others.

ElasticSearch London Meetup – a busy and interesting evening!

I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene’s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

How we built a search engine for UK MP tweets with Solr, Python & StanfordNLP

Matt Pearce writes:

We recently released UKMP, a search application built on work done on last year’s Enterprise Search hack day. This presents the tweets of UK Members of Parliament with search options including filtering by party, retweet and favourite count, and entities (people, locations and organisations) extracted from the tweet text. This is obviously its first incarnation, so there are still a number of features in development, but I thought I would comment on some of the decisions taken while developing the site.

I started off by deciding which bits of the hack day code would be most useful, from both the Solr set-up side and the web application we were hoping to build. During the hack day, the group had split into a number of smaller teams, with two of them working on a set of data downloaded from Twitter, containing the original set of UK MP tweets. I took the basic Solr setup and indexing code from one group, and the initial web application from the other.

Obviously we couldn’t work with a completely static data set, so I set about putting together a Python script to grab the tweets. This was where I met the first hurdle: I was trying to grab tweets from individual MPs’ feeds, but kept getting blocked by the Twitter API, even though I didn’t think I was over-stepping the limits set on the calls. With 200-plus MPs to track, a different approach would be required to avoid being blocked. Eventually, I took a different approach, and started using the lists compiled by Tweetminster, who track politicians tweets themselves. This worked much better, and I could soon start building a useful data set.

I chose the second group’s web application because it already used the Stanford NLP software to extract entities from the tweet text. The indexer script, also written in Python, calls the web app to extract the entities before indexing the tweets. We spent some time trying to incorporate the Stanford sentiment analysis as well, but found it wasn’t practical – the response time was too slow, and we didn’t have time to train the dataset to provide a more useful analysis of the content (almost all tweets were rated as either “negative” or “neutral”, which didn’t accurately reflect the sentiments in the data).

Since this was an entirely new project, and because it was being done outside the main client workflow, I took the opportunity to try out AngularJS, an MVC-oriented JavaScript front-end framework. This runs on top of, and calls back to, the DropWizard web application, which provides the Model part of the Model-View-Controller system. AngularJS itself provides the Controller, while the Views are all written in fairly standard HTML, with some AngularJS frosting to fill in the content.

AngularJS itself generally made development very easy and fast, and I was pleased by how little JavaScript I had to write to build a working application (there is also a Bootstrap crossover module, providing AngularJS directives to work with the UI layout tools Bootstrap provides). As a small site, there are only two controllers in play: one for each page. AngularJS also makes it very easy to plug in other script modules, such as that used to generate the word cloud on the About page. However, I did come across a few sticking points as I built the app, as one might expect from a first-time user. The principle one was handling the search box at the top of the page, which had to be independent of the view while needing to modify it to display the search results. I am still not sure that I ended up with the best approach – the search form fires an event when submitted, which then percolates up the AngularJS control hierarchy until caught and dealt with: within the search page, the search is handled normally; from other pages, we redirect to the search page and pass in the term. It doesn’t feel as smooth as it should do, which is why I remain unconvinced this is the best solution.

All in all, this was an interesting sideline project, and provided a good excuse to try out some new technology. The code itself, along with some notes on how to get the system up and running, is in our github repository – feel free to try it out, and make suggestions for improvements or better ways to use the code.

The closed-source topping on the open-source Elasticsearch

Today Elasticsearch (the company, not the software) announced their first commercial, closed-source product, a monitoring plugin for Elasticsearch (the software, not the company – yes I know this is confusing, one might suspect deliberately so). Amongst the raft of press releases there are a few small liberties with the truth, for example describing Elasticsearch (the company) as ‘founded in 2012 by the people behind the Elasticsearch and Apache Lucene open source projects’ – surely the latter project was started by Doug Cutting, who isn’t part of the aforementioned company.

Adding some closed-source dusting to a popular open-source distribution is nothing new of course – many companies do it, especially those that are venture funded – it’s a way of building intellectual property while also taking full advantage of the open-source model in terms of user adoption. Other strategies include curated distributions such as that offered by Heliosearch, founded by Solr creator Yonik Seeley and our partner LucidWorks‘ complete packaged search applications. It can help lock potential clients into your version of the software and your vision of the future, although of course they are still free to download the core and go it alone (or engage people like us to help do so), which helps them retain some control.

It’s going to be interesting to see how this strategy develops for Elasticsearch (for the last time, the company). At Flax we’ve also built various additional software components for search applications – but as we have no external investors to please these are freely available as open-source software, including Luwak our fast stored query engine, Clade a taxonomy/classification prototype and even some file format extractors.

Search events for 2014

Here’s a few search-related events over the next few months for your consideration:

  • On Tuesday 1st April the International Society for Knowledge Organisation are holding a seminar in London on ‘Taming the News Beast‘ with contributions from the BBC and Press Association amongst others. We’ll be attending as many of our clients are from the news sector.
  • On 29th-30th April (with workshops the day before) we have Enterprise Search Europe, now at a new (slightly more central) London venue and with presentations from Ernst & Young, Reed Elsevier, MAN Truck & Bus, AstraZeneca and the University of London – do take a look at the really strong programme this year. On the Monday I’ll be repeating my workshop on Getting the Best from Open Source Search for those interested in planning and/or implementing an open source search application. I’m very pleased to be able to offer a 20% discount on registration fees – just use the code HULL20 when you apply.
  • Berlin Buzzwords is held on May 25th-28th with the usual mix of talks on Search, Store and Scale – this is always a popular event and we expect someone from Flax will attend.
  • I’ll post up more events as they are announced – we’re also hoping to hold another Cambridge Search Meetup soon. Do let me know if you’d like to meet up at any of the events above!

    Tags: ,

    Posted in events

    January 21st, 2014

    No Comments »

Time for the crystal ball again…

It’s always fun to make predictions about the future, especially as one can be pretty sure to be proved wrong in interesting ways. At the start of 2014 we at Flax are looking forward to another year of building open source search and we already have some great client projects in progress that we’ll shortly be able to talk about, but what else might be happening this year? Here’s some points to note:

  • The Elasticsearch project continues to add features at a prodigious rate during the arms race between it and Apache Solr – this battle can only be good news for end users in our view. We can expect a 1.0 release of Elasticsearch this year and several further major 4.x releases of Solr.
  • The Solr world has become slightly more complex as original author Yonik Seeley has left Lucidworks to start his own company, Heliosearch – with its own packaged distribution of Solr. How will Heliosearch contribute to the Solr ecosystem?
  • HP Autonomy is a sponsor of the Enterprise Search Europe conference this year, although there’s still some fallout from HP’s acquisition of Autonomy, and little news from the various official investigations into this process. Perhaps this year HP’s overall strategy will become a little clearer.
  • The Big Data bandwagon rolls on and more or less every search company now stresses its capabilities in this area for marketing purposes: but how big is Big? It’s not enough just to re-quote IDC’s latest study on how many exobytes everyone is producing these days, the value is in the detail, not the sheer volume: good (and deep) analytics is the key.
  • We think there might be some interesting things happening around open source search and bioinformatics soon – watch this space!

Tags: , , , , , ,

Posted in News

January 7th, 2014

No Comments »

Principles of Solr application design – part 2 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:

˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃
˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃
˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:

˂requestHandler name="/search" class="solr.SearchHandler"˃
˂lst name="defaults"˃
˂str name="qf"˃content_en content_fr content_jp˂/str˃
˂str name="pf"˃content_en content_fr content_jp˂/str˃
...

When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

Tags: , , , , ,

Posted in Reference, Technical

December 17th, 2013

No Comments »

Principles of Solr application design – part 1 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! So without further ado here’s the first part:

1. Use the latest release of Solr

Unless there are compelling reasons not to, such as reliance on a discontinued feature (which is rare), it is best to use the latest release of Solr, downloaded from http://lucene.apache.org/solr/ . Every minor release in the 4.x series has brought both functional and performance enhancements, and revision releases have fixed known bugs. Since the API (as a rule) remains backwards compatible, the potential gains in performance and utility should outweight the minor inconvenience of the upgrade.

2. Use SolrCloud for scaling and robustness

Before the Solr 4 release, support for sharding (distributing a single search over many Solr instances) and replication (for robustness and scaling search load) involved a significant amount of manual configuration and development. The introduction of SolrCloud means that sharding and replication are now built into the core product, and can be used with simple configuration and no extra coding.

For trivial applications, SolrCloud may not be required, but it is the simplest way to build in robustness and scalability. There’s more about SolrCloud here.

3. Don’t expose the Solr API

Although Solr is not inherently insecure, neither is it designed to be exposed to end-users (and emphatically not to the internet at large). Anyone with access to the root Solr endpoint would be able to delete indexes, modify or insert items at will. Restricting access to search handlers (e.g. /solr/select) avoids this possibility, but is nonetheless a bad idea since it may allow users to construct arbitrary queries which could degrade performance or provide access to unauthorised data. Furthermore, there remains the slim possibility of security holes in the Solr API.

For these reasons, any external access to search should be through a proxy interface which is restricted to the functionality required by the application. Access to the Solr API should be restricted by network design and/or firewalls. This applies equally to AJAX UIs, which should talk to Solr via an intermediary web application rather than directly.

The intermediary code should perform at least some basic validation of parameters before sending to Solr, for example checking their type and ensuring that query strings are under a certain length (depending on the search interface). This allows attempts at compromising the system to be detected at an early stage and blocked.

4. Don’t use third-party Solr client libraries

The problem with third-party client libraries is that they create a tight coupling between the application and Solr. The Solr XML and JSON APIs are simple, and a wide range of client libraries for these formats are readily available for most programming languages. Third-party libraries are an unnecessary additional dependency and a potential source of bugs and unexpected behaviour. Another risk is that development may be discontinued for various reasons, meaning that future Solr features are not easily accessible.

The one exception to this rule is the SolrJ Java client library, since it is part of the general Solr release and is therefore fully compliant with and tested against the corresponding version of Solr.

5. Specify interfaces

All interfaces between components in the application must be agreed between sys ops and developers before development is started. Interfaces should be treated as contracts which software components adhere to. Early documentation of interfaces will reduce the risk of unexpected dependencies leading to problems in deployment.

As far as possible, interfaces should be RESTful web APIs and use standard formats such as JSON and XML. This creates loose coupling between components and also makes it easy to test functionality from the command line or a browser.

6. Put apps live early, on isolated systems

Development should be iterative, with short development cycles (no more than a few weeks). Code should be tested and deployed at the end of each cycle. By using isolated systems, fake data and/or limiting access to authorised testers, functionality and performance may be tested as soon as possible on a ‘live’ system, avoiding the risk of unexpected problems if deployment is postponed until the end of the development cycle.

7. Do realistic performance tests early and often

Except for very small indexes, search performance is often unpredictable, particularly under load. To ensure that performance meets requirements, testing a full index under load with realistic queries should be scheduled as early as possible in development. If you don’t have the data available to create a full index, simulate it (e.g. using freely available text such as Wikipedia).

As new functions, e.g. facets, are added performance characteristics may change significantly, so it is important that performance tests are part of every development cycle. JMeter is a popular tool for load testing; alternatively test scripts could be easily written in a language like Python.

More to come next week!

Tags: , , , , ,

Posted in Reference, Technical

December 11th, 2013

No Comments »