Posts Tagged ‘events’

Open source search events roundup for late 2015

Although it’s still high summer here in the UK (which means it’s probably raining) we’re already looking forward to the autumn and the events across the world we’re attending. In early September we’re running another free to attend London Lucene/Solr Usergroup Meetup, sponsored this time by Blackrock who are talking about using Solr for websites. At the end of September there is another Elasticsearch London Meetup which we will also attend (and may be speaking at this time).

October brings the biggest event in the Lucene/Solr calendar, Lucene Revolution in Austin, Texas, a 4-day event with training and a conference. We’re happy to announce that Alan Woodward and Matt Pearce from Flax will be presenting “Searching the Stuff of Life: BioSolr” about our work with the European Bioinformatics Institute where we’ve been developing Solr features for use by bioinformaticians (and any others who find them useful of course!), for example ontology indexing and external JOINs.

A week later we’ll be at Enterprise Search Europe, where I’ll be delivering the keynote on The Future of Search (you can see an earlier version of this talk from the IKO Singapore conference last month). We’re also running a Meetup on the evening of the 20th open to both conference attendees and others – an informal chance to chat with other search folks. During the conference itself I’m particularly looking forward to hearing from Ian Williams of NHS Wales on Powering the Single Patient Record in NHS Wales with Apache Solr – this is a very large scale and exciting project using Solr for healthcare data.

Looking further ahead, in November we have plans to attend (and possibly speak) at Search Solutions 2015, a great one-day event in London which I highly recommend, and we are planning another event in Singapore together with a partner. As ever, do let us know if you would like to meet up at an event and talk open source search!

BioSolr at BOSC 2015 – open source search for bioinformatics

Matt Pearce writes:

I spent most of last Friday at the Bioinformatics Open Source Conference (BOSC) Special Interest Group meeting in Dublin, as part of this year’s ISMB/ECCB conference. Tony Burdett from EMBL-EBI was giving a quick talk about the BioSolr project, and I went along to speak to people at the poster session afterwards about what we are doing, and how other teams could get involved.

Unfortunately, I missed the first half of Holly Bik’s keynote (registration seemed to take forever, hindered by dubious wifi and a printer that refused to cooperate), which used the vintage Oregon Trail game as an great analogy for biologists getting into bioinformatics – there are many, frequently intimidating, options when choosing how to analyse data, and picking the right one can be scary (this is something that definitely applies to the areas we work in as well).

There was a new approach to the traditional Q&A session afterwards as well, with questions being submitted on cards around the room, and via a Twitter hashtag. This worked pretty well, although Twitter latency did slow things down a couple of times, and there were a few shouted-out questions from the floor, but certainly better than having volunteers with microphones trying to reach the questioner across rows of people.

The morning session was on Data Science, and while a number of the talks went over my head somewhat, it was interesting to see how tools like Hadoop are being used in Bioinformatics. It was good to see the spirit of collaboration in action too, with Sebastian Schoenherr’s talk about CloudGene, a project that came about following an earlier BOSC that implements a graphical front end for Hadoop. Tony’s talk about BioSolr went down well – the show of hands for people in the room using Lucene, Solr and/or Elasticsearch indicated around 75% there were using search engines in some form. This backs up our earlier experience at the EBI, where the first BioSolr workshop was attended by teams from all over the campus, using Lucene or Solr in various versions to store and search their data.

Crossing over with lunch was the poster session, where Tony and I spoke to people about BioSolr. The Jalview team seemed especially interested in potential cross-over with their project, and there was plenty of interest generally in how the various extensions we have worked on (X-Join, hierarchical faceting) could be fitted into other projects.

The afternoon session was on the subject of Standards and Interoperability, starting with a great talk from Michael Crusoe about the Common Workflow Language, which started life at the BOSC 2014 codefest. There were several talks about Galaxy, a cloud-based platform for sharing data analyses, linking many other tools to allow workflows to be reproduced. Bruno Vieira’s talk about BioNode was also very interesting, and I made notes to check out oSwitch when time is available.

I had to leave before the afternoon’s panel took place, but all in all it was a very interesting day learning how open source software is being used outside of the areas I usually work in.

Innovations in Knowledge Organisation, Singapore: a review

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore’s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge’s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

London Lucene/Solr Usergroup – Search Relevancy & Hacking Lucene with Doug Turnbull

Last week Doug Turnbull of US-based Open Source Connections visited the UK and spoke at our Meetup. His first talk was on Search Relevancy, an area that we often deal with at Flax: how to tune a search engine to give results that our clients deem relevant, without affecting the results for other queries. Using a client project as an example, Doug talked about how he created a tool to record relevance judgements for a set of queries (or a ‘case’). The underlying Solr search engine could then be adjusted and the tool re-runs the queries to show any change in the position of the scored results. Slides and video of the talk are available – thanks to our hosts SkillsMatter for these.

The tool, Quepid, is a great way to allow non-developers to score search results – in most cases we have seen, if this kind of testing is done at all it is recorded using spreadsheets. The tests then need to be re-run manually and scores updated, which can result in the tuning process taking far too long. This whole area is in need of some rigor and best practise, and to that end Doug is writing a book on Relevant Search which we’re very much looking forward to.

Doug’s second talk was on Hacking Lucene for custom search results, during which he dissected how Lucene queries actually work and how custom scoring algorithms can be used to change search ranking. Although highly technical in parts – and as Doug said, one of the hardest ways to write Lucene code to influence ranking and thus relevance – it was a great window on Lucene’s low level behaviour. Again, slides and video are available.

Thanks to all who came and especially Doug for coming so far to present his talks!

Tags: , , , ,

Posted in Technical, events

June 11th, 2015

No Comments »

Going international – open source search in London, Berlin & Singapore

We’re travelling a bit over the next few weeks to visit and speak at various events. This weekend Alan Woodward is at Berlin Buzzwords, a hacker-focused conference with a programme full of search talks. He’s not speaking this year, but if you want to talk about Lucene, Solr or our own Luwak stored search library and the crazy things you can do with it, do buy him a beer!

Next week we’re hosting another London Lucene/Solr User Group Meetup with Doug Turnbull of Open Source Connections. Doug is the author of a forthcoming book on Relevant Search and the creator of Quepid, a tool for gathering relevance judgements for Solr-based search systems and then seeing how these scores change as you tune the Solr installation. Tuning relevance is a very common (and often difficult) task during search projects and can make a significant difference to the user experience (and in particular, for e-commerce can hugely affect your bottom line) – so we’re very much looking forward to Doug’s talk.

The week after I’m in Singapore visiting the Innovations in Knowledge Organisation conference – a new event focusing on knowledge management and search. I’ve been asked to talk about open source search and to keynote the second day of the event and speak on ‘The Future of Search’. Do let me know if you’re attending and would like to meet up.

Tags: , , , , ,

Posted in events

May 29th, 2015

No Comments »

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

Elastic London User Group Meetup – scaling with Kafka and Cassandra

The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.

IntraTeam 2015 – a brief visit

Last week I dropped in on the IntraTeam 2015 conference in Copenhagen, an event focused on intranets with some content on enterprise search. After a rather pleasant evening of Thai food and networking I attended the last day of the event. The keynote speaker was Dave Snowden, who has an amusing and rather curmudgeonly style of presentation, making sure to note the previous presenters he’d disagreed with for their over-reliance on simplistic concepts of knowledge and how the brain works. His talk was however very interesting and introduced the Cynevin framework (a Welsh word which apparently refers to homing sheep!). He also discussed how the rush to digitisation has had a cost in terms of human cognition, how the concept of an intranet will soon disappear (a brave assertion at an intranet conference) and how future systems should perhaps use storytelling metaphors – with some great examples of how collecting these micro-narratives from employees and others can produce extremely rapid feedback on the health of a business.

Andreas Hallgren of Chalmers University showed the evolution of their site-wide search facility, now based on Apache Solr. Unsurprisingly one of the main problems was determining who ‘owns’ search in their organisation: at least now they have a staff member who dedicates 25% of their time to improving search. He had some interesting points about the seasonality of academic searches and how analytics can be used to ‘measure more, guess less’. I was up next talking about Search Turned Upside Down, using a similar set of slides to this one: thanks to all who came and asked some great questions.

Next was Helen Lippell who I have heard speak before on how to get Enterprise Search right – Helen had some great anecdotes and guidance for an attentive audience. Ed Dale followed with five tips for great search: index the right content, optimise this content, measure search, make a great UI and listen to your users – I can only agree! He also characterised the different kinds of content including the worrying ‘content we think we have but we don’t’. The last presentation I attended was by Anders Quitzau of IBM on their fascinating Watson technology: sadly this was a rather marketing-heavy set of slides, with plenty of newly minted buzzwords such as Cognitive Computing and very little useful detail.

Thanks to Kurt Kragh Sorenson and Kristian Norling for inviting me to speak and attend the conference, next time I hope to see a little more of the event!

Lucene/Solr London User Group – Alfresco & Datastax

We had another London user group Meetup last week, hosted by Reed.co.uk who also provided some tasty pizza – eaten under the ‘Love Mondays’ sign from their adverts, which now lives in their boardroom! A few new faces this time and a couple of great talks from two companies who have incorporated Solr into their platforms.

First up was Andy Hind, a founding developer of document management company Alfresco, who told us all about how they originally based their search capability on Lucene 2.4, then moved to Solr 4.4 and most recently version 4.9.1. Using Solr they have implemented often complex security requirements (originally using a PostFilter as Erik Hatcher describes and more recently in the query itself), structured queries (using Phrase and SpanQueries) and their own domain specific query language (DSL) – they can support SQL-like, Lucene and Google-like queries by passing them through parsers based on ANTLR to be served either by the search engine or whatever relational database Alfresco is using. The move to a recent version of Solr has allowed the most recent release of Alfresco to support various modern search features (facets, spelling suggestions etc.) but Andy did mention that so far they are not using SolrCloud for scaling, preferring to manage this themselves.

Next up was Sergio Bossa of Datastax, talking about how their Datastax Enterprise (DSE) product incorporates Solr searching within an Apache Cassandra cluster. Sergio has previously spoken at our Cambridge search meetup on a very similar subject, so I won’t repeat myself here, but the key point is that Solr lives directly on top of the Cassandra cluster, so you don’t have to worry about it at all – search features are directly available from the Cassandra APIs. Like Alfresco, this is an alternative to SolrCloud (assuming you also need a NoSQL database of course!).

Thanks again to Alex Rice for hosting the Meetup, to both our speakers and to all who came – we’ll return soon! In the meantime you may want to check out a few events coming later this year: Berlin Buzzwords, ApacheCon Europe and Lucene/Solr Revolution.

Tags: , , , ,

Posted in Technical, events

February 16th, 2015

No Comments »

Elasticsearch London Meetup: Templates, easy log search & lead generation

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you’ll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly).

First up was Reuben Sutton of Artirix, describing how his team had moved away from the Elasticsearch Ruby libraries (which can be very slow, mainly due to the time taken to decode/encode data as JSON) towards the relatively new Mustache templating framework. This has allowed them to remove anything complex to do with search from their UI code, although they have had some trouble with Mustache’s support for partial templates. They found documentation was somewhat lacking, but they have contributed some improvements to this.

Next was David Laing of CityIndex describing Logsearch, a powerful way to spin up clusters of ELK (Elasticsearch+Logstash+Kibana) servers for log analysis. Based on the BOSH toolchain and open sourced, this allows CityIndex to create clusters in minutes for handling large amounts of data (they are currently processing 50GB of logs every day). David showed how the system is resilient to server failure and will automatically ‘resurrect’ failed nodes, and interestingly how this enables them to use Amazon spot pricing at around a tenth of the cost of the more stable AWS offerings. I asked how this powerful system might be used in the general case of Elasticsearch cluster management but David said it is targetted at log processing – but of course according to some everything will soon be a log anyway!

The last talk was by Alex Mitchell and Francois Bouet of Growth Intelligence who provide lead generation services. They explained how they have used Elasticsearch at several points in their data flow – as a data store for the web pages they crawl (storing these in both raw and processed form using multi-fields), for feature generation using the term vector API and to encode simple business rules for particular clients – as well as to power the search features of their website, of course.

A short Q&A with some of the Elasticsearch team followed: we heard that the new Shield security plugin has had some third-party testing (the details of which I suggested are published if possible) and a preview of what might appear in the 2.0 release – further improvements to the aggregrations features including derivatives and anomaly detection sound very useful. A swift drink and natter about the world of search with Mark Harwood and it was time to get the train home. Thanks to all the speakers and of course Yann for organising as ever – see you next time!