Posts Tagged ‘SOLR’

Open source search events roundup for late 2015

Although it’s still high summer here in the UK (which means it’s probably raining) we’re already looking forward to the autumn and the events across the world we’re attending. In early September we’re running another free to attend London Lucene/Solr Usergroup Meetup, sponsored this time by Blackrock who are talking about using Solr for websites. At the end of September there is another Elasticsearch London Meetup which we will also attend (and may be speaking at this time).

October brings the biggest event in the Lucene/Solr calendar, Lucene Revolution in Austin, Texas, a 4-day event with training and a conference. We’re happy to announce that Alan Woodward and Matt Pearce from Flax will be presenting “Searching the Stuff of Life: BioSolr” about our work with the European Bioinformatics Institute where we’ve been developing Solr features for use by bioinformaticians (and any others who find them useful of course!), for example ontology indexing and external JOINs.

A week later we’ll be at Enterprise Search Europe, where I’ll be delivering the keynote on The Future of Search (you can see an earlier version of this talk from the IKO Singapore conference last month). We’re also running a Meetup on the evening of the 20th open to both conference attendees and others – an informal chance to chat with other search folks. During the conference itself I’m particularly looking forward to hearing from Ian Williams of NHS Wales on Powering the Single Patient Record in NHS Wales with Apache Solr – this is a very large scale and exciting project using Solr for healthcare data.

Looking further ahead, in November we have plans to attend (and possibly speak) at Search Solutions 2015, a great one-day event in London which I highly recommend, and we are planning another event in Singapore together with a partner. As ever, do let us know if you would like to meet up at an event and talk open source search!

BioSolr at BOSC 2015 – open source search for bioinformatics

Matt Pearce writes:

I spent most of last Friday at the Bioinformatics Open Source Conference (BOSC) Special Interest Group meeting in Dublin, as part of this year’s ISMB/ECCB conference. Tony Burdett from EMBL-EBI was giving a quick talk about the BioSolr project, and I went along to speak to people at the poster session afterwards about what we are doing, and how other teams could get involved.

Unfortunately, I missed the first half of Holly Bik’s keynote (registration seemed to take forever, hindered by dubious wifi and a printer that refused to cooperate), which used the vintage Oregon Trail game as an great analogy for biologists getting into bioinformatics – there are many, frequently intimidating, options when choosing how to analyse data, and picking the right one can be scary (this is something that definitely applies to the areas we work in as well).

There was a new approach to the traditional Q&A session afterwards as well, with questions being submitted on cards around the room, and via a Twitter hashtag. This worked pretty well, although Twitter latency did slow things down a couple of times, and there were a few shouted-out questions from the floor, but certainly better than having volunteers with microphones trying to reach the questioner across rows of people.

The morning session was on Data Science, and while a number of the talks went over my head somewhat, it was interesting to see how tools like Hadoop are being used in Bioinformatics. It was good to see the spirit of collaboration in action too, with Sebastian Schoenherr’s talk about CloudGene, a project that came about following an earlier BOSC that implements a graphical front end for Hadoop. Tony’s talk about BioSolr went down well – the show of hands for people in the room using Lucene, Solr and/or Elasticsearch indicated around 75% there were using search engines in some form. This backs up our earlier experience at the EBI, where the first BioSolr workshop was attended by teams from all over the campus, using Lucene or Solr in various versions to store and search their data.

Crossing over with lunch was the poster session, where Tony and I spoke to people about BioSolr. The Jalview team seemed especially interested in potential cross-over with their project, and there was plenty of interest generally in how the various extensions we have worked on (X-Join, hierarchical faceting) could be fitted into other projects.

The afternoon session was on the subject of Standards and Interoperability, starting with a great talk from Michael Crusoe about the Common Workflow Language, which started life at the BOSC 2014 codefest. There were several talks about Galaxy, a cloud-based platform for sharing data analyses, linking many other tools to allow workflows to be reproduced. Bruno Vieira’s talk about BioNode was also very interesting, and I made notes to check out oSwitch when time is available.

I had to leave before the afternoon’s panel took place, but all in all it was a very interesting day learning how open source software is being used outside of the areas I usually work in.

The four types of open source search project

As I’m currently writing content for our new Flax website (which is taking far longer than anticipated for various reasons I won’t bore you with) I’ve been thinking about the sort of projects we encounter at Flax. You might find this useful if you’re planning or starting a search project with Solr or Elasticsearch. Note that not everything we do fits cleanly into these four categories!

The search idea

So you’ve got this idea and you’re convinced that you need search as part of the puzzle, but you’re not sure where it fits, whether it will be performant or how to gather and transform your data so it’s ready for searching. Perhaps you’re from a startup, or maybe part of a skunkworks projects in a larger organisation. What you need is someone who really understands search software and what can be done with it to sit with you for a day or two, validate your technical choices, help you understand how to shape your data, even play with some basic indexing.

The proof of concept

You’re a little further along – you know what technology you’ll be using and you have some data all ready for indexing. However, before your funders or boss will release more budget you need to build something they can see (and search) – you’ll need an indexer and a basic search application. You could do it yourself but time is limited and you’ve not built a search application before. You’re expecting to spend a week or two developing something to show others, that lets them search real data and see real results. You might also want to experiment with scale – see what happens to performance when you add a few million items to the index, even if the schema isn’t quite right yet.

The big one

You’re building the big one – indexing complex data or many millions of items, and/or for a huge user base. You need to be very sure your indexing pipeline is fast, scales well, copes with updates and can transform data from many sources. You need to develop the very best search schema. Your search architecture must be resilient, cope with heavy load, failover cleanly and give the correct results. You’re assembling a team to build it but you need specialist help from people who have built this kind of system at scale before.

The migration

Finally you’ve secured budget to move away from the slow and innacurate search engine that everyone hates! Search really does suck, but you now have a chance to make it better. However, although you know how to keep the old engine running you don’t have much experience of open source search. Even though the old engine isn’t great, you’re doing a lot of business with it and you want to be confident that relevance is as good (and hopefully better) with the new engine – maybe you want to develop a testing framework?

We’re also increasingly delivering training (both for business users who want to know the capabilities of open source search and for technical users who want to improve their knowledge – we can tailor this to your requirements) and ongoing support – but everything starts with a search project of some kind!

Innovations in Knowledge Organisation, Singapore: a review

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore’s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge’s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

London Lucene/Solr Usergroup – Search Relevancy & Hacking Lucene with Doug Turnbull

Last week Doug Turnbull of US-based Open Source Connections visited the UK and spoke at our Meetup. His first talk was on Search Relevancy, an area that we often deal with at Flax: how to tune a search engine to give results that our clients deem relevant, without affecting the results for other queries. Using a client project as an example, Doug talked about how he created a tool to record relevance judgements for a set of queries (or a ‘case’). The underlying Solr search engine could then be adjusted and the tool re-runs the queries to show any change in the position of the scored results. Slides and video of the talk are available – thanks to our hosts SkillsMatter for these.

The tool, Quepid, is a great way to allow non-developers to score search results – in most cases we have seen, if this kind of testing is done at all it is recorded using spreadsheets. The tests then need to be re-run manually and scores updated, which can result in the tuning process taking far too long. This whole area is in need of some rigor and best practise, and to that end Doug is writing a book on Relevant Search which we’re very much looking forward to.

Doug’s second talk was on Hacking Lucene for custom search results, during which he dissected how Lucene queries actually work and how custom scoring algorithms can be used to change search ranking. Although highly technical in parts – and as Doug said, one of the hardest ways to write Lucene code to influence ranking and thus relevance – it was a great window on Lucene’s low level behaviour. Again, slides and video are available.

Thanks to all who came and especially Doug for coming so far to present his talks!

Tags: , , , ,

Posted in Technical, events

June 11th, 2015

No Comments »

Going international – open source search in London, Berlin & Singapore

We’re travelling a bit over the next few weeks to visit and speak at various events. This weekend Alan Woodward is at Berlin Buzzwords, a hacker-focused conference with a programme full of search talks. He’s not speaking this year, but if you want to talk about Lucene, Solr or our own Luwak stored search library and the crazy things you can do with it, do buy him a beer!

Next week we’re hosting another London Lucene/Solr User Group Meetup with Doug Turnbull of Open Source Connections. Doug is the author of a forthcoming book on Relevant Search and the creator of Quepid, a tool for gathering relevance judgements for Solr-based search systems and then seeing how these scores change as you tune the Solr installation. Tuning relevance is a very common (and often difficult) task during search projects and can make a significant difference to the user experience (and in particular, for e-commerce can hugely affect your bottom line) – so we’re very much looking forward to Doug’s talk.

The week after I’m in Singapore visiting the Innovations in Knowledge Organisation conference – a new event focusing on knowledge management and search. I’ve been asked to talk about open source search and to keynote the second day of the event and speak on ‘The Future of Search’. Do let me know if you’re attending and would like to meet up.

Tags: , , , , ,

Posted in events

May 29th, 2015

No Comments »

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

IntraTeam 2015 – a brief visit

Last week I dropped in on the IntraTeam 2015 conference in Copenhagen, an event focused on intranets with some content on enterprise search. After a rather pleasant evening of Thai food and networking I attended the last day of the event. The keynote speaker was Dave Snowden, who has an amusing and rather curmudgeonly style of presentation, making sure to note the previous presenters he’d disagreed with for their over-reliance on simplistic concepts of knowledge and how the brain works. His talk was however very interesting and introduced the Cynevin framework (a Welsh word which apparently refers to homing sheep!). He also discussed how the rush to digitisation has had a cost in terms of human cognition, how the concept of an intranet will soon disappear (a brave assertion at an intranet conference) and how future systems should perhaps use storytelling metaphors – with some great examples of how collecting these micro-narratives from employees and others can produce extremely rapid feedback on the health of a business.

Andreas Hallgren of Chalmers University showed the evolution of their site-wide search facility, now based on Apache Solr. Unsurprisingly one of the main problems was determining who ‘owns’ search in their organisation: at least now they have a staff member who dedicates 25% of their time to improving search. He had some interesting points about the seasonality of academic searches and how analytics can be used to ‘measure more, guess less’. I was up next talking about Search Turned Upside Down, using a similar set of slides to this one: thanks to all who came and asked some great questions.

Next was Helen Lippell who I have heard speak before on how to get Enterprise Search right – Helen had some great anecdotes and guidance for an attentive audience. Ed Dale followed with five tips for great search: index the right content, optimise this content, measure search, make a great UI and listen to your users – I can only agree! He also characterised the different kinds of content including the worrying ‘content we think we have but we don’t’. The last presentation I attended was by Anders Quitzau of IBM on their fascinating Watson technology: sadly this was a rather marketing-heavy set of slides, with plenty of newly minted buzzwords such as Cognitive Computing and very little useful detail.

Thanks to Kurt Kragh Sorenson and Kristian Norling for inviting me to speak and attend the conference, next time I hope to see a little more of the event!

Lucene/Solr London User Group – Alfresco & Datastax

We had another London user group Meetup last week, hosted by Reed.co.uk who also provided some tasty pizza – eaten under the ‘Love Mondays’ sign from their adverts, which now lives in their boardroom! A few new faces this time and a couple of great talks from two companies who have incorporated Solr into their platforms.

First up was Andy Hind, a founding developer of document management company Alfresco, who told us all about how they originally based their search capability on Lucene 2.4, then moved to Solr 4.4 and most recently version 4.9.1. Using Solr they have implemented often complex security requirements (originally using a PostFilter as Erik Hatcher describes and more recently in the query itself), structured queries (using Phrase and SpanQueries) and their own domain specific query language (DSL) – they can support SQL-like, Lucene and Google-like queries by passing them through parsers based on ANTLR to be served either by the search engine or whatever relational database Alfresco is using. The move to a recent version of Solr has allowed the most recent release of Alfresco to support various modern search features (facets, spelling suggestions etc.) but Andy did mention that so far they are not using SolrCloud for scaling, preferring to manage this themselves.

Next up was Sergio Bossa of Datastax, talking about how their Datastax Enterprise (DSE) product incorporates Solr searching within an Apache Cassandra cluster. Sergio has previously spoken at our Cambridge search meetup on a very similar subject, so I won’t repeat myself here, but the key point is that Solr lives directly on top of the Cassandra cluster, so you don’t have to worry about it at all – search features are directly available from the Cassandra APIs. Like Alfresco, this is an alternative to SolrCloud (assuming you also need a NoSQL database of course!).

Thanks again to Alex Rice for hosting the Meetup, to both our speakers and to all who came – we’ll return soon! In the meantime you may want to check out a few events coming later this year: Berlin Buzzwords, ApacheCon Europe and Lucene/Solr Revolution.

Tags: , , , ,

Posted in Technical, events

February 16th, 2015

No Comments »

Out and about in January and February

We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon.

Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels (31st January-1st February) on Searching over streams with Luwak and Apache Samza – about some fascinating work they’ve been doing to combine the powerful ‘reverse search’ facilities of our Luwak library with Apache Samza’s distributed, stream-based processing. We’re hoping this means we can scale Luwak beyond its current limits (although those limits are pretty accomodating, as we know of systems where a million or so stored searches are applied to a million incoming messages every day). If you’re interested in open source search the Devroom they’re speaking in has lots of other great talks planned.

Next I’m talking about the wider applications of this kind of reverse search in the area of media monitoring, and how open source software in general can help you turn your organisation’s infrastructure upside down, at the Intrateam conference event in Copenhagen from February 24th-26th. Scroll down to find my talk at 11.35 am on Thursday 26th.

If you’d like to meet us at either of these events do get in touch.