Posts Tagged ‘networking’

Enterprise Search & Discovery 2014, Washington DC

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

Tags: , , , , , ,

Posted in events

November 12th, 2014

No Comments »

Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup

It’s looking like a busy Autumn for search events – first, I’m presenting at Enterprise Search & Discovery 2014 in Washington DC on November 5th, talking about ‘Turning Search Upside Down with open source software’. I’ll be describing how we’ve replaced various underperforming, big name closed source search engines with faster & more scalable open source technology, including our own Luwak stored query engine. Do let me know if you’re in DC, I’d be very happy to meet up. The week after this is Lucene Revolution, which sadly we won’t be attending this year, but it is recommended if you’re interested in Lucene and Solr.

Towards the end of November there’s Search Solutions, a great day of presentations about all aspects of search held at the British Computer Society in Covent Garden. This year Tom Mortimer from Flax will be presenting some research we’ve done into performance comparisons between Lucene/Solr and Elasticsearch, and there are also presentations from Thomson Reuters, the British Library, Microsoft, Yahoo! and Google. I highly recommend this event, it’s always worth attending.

We’re also starting a new Meetup in London, a group for users of Apache Lucene/Solr (there’s an Elasticsearch London user group but strangely no equivalent for the other popular stack). Our first event is on November 28th, kindly hosted by Bloomberg (who are no strangers to Lucene/Solr themselves) and featuring Shalin Mangar, a Lucene/Solr committer from Lucidworks who is visiting Europe that week. We’re hoping that we can run these events every few months, but we need help from the community, so if you could talk, sponsor or host the Meetups do let us know.

In December we’ll be holding another Cambridge Search Meetup and will be talking about our work with the European Bioinformatics Institute on the BioSolr project – the date to be confirmed. Busy times!

Cambridge Search Meetup – Elasticsearch Hackday

Last Friday we hosted a hackday featuring Elasticsearch in Cambridge, following a similar event last year focused on Apache Lucene/Solr. Around 20 people attended from organisations working in sectors including analytics, digital music, bioinformatics and e-commerce, and all the Flax team were there as well.

We started with a brief presentation on Elasticsearch and asked around the room for any data collections we might be able to use. Lee from Elasticsearch (the company) had brought collections of UK crime data and the complete works of Shakespeare; we also had several million rows of digital music metadata, Wikipedia edit data for all UK MPs (to follow last year’s theme!) and several years of data describing Premier League football. Unlike our Solr hackday where each team worked on the same general task, this time we split into four different teams who worked on all of the above except the Wikipedia edits. We’d also been provided with a very high-performance Elasticsearch cluster by BigStep for our use, which meant it was very quick to index the above data and start working with it.

By lunchtime (the food was sponsored by Elasticsearch, who also provided stickers, plush ELKs and lollypops – thanks guys!) we had some very basic information about the various datasets – such as which scene in which Shakespeare play has the most characters on stage (the answer is 21 in Richard III), and which football teams seemed to gain the most advantage from playing at home. Note that we had already moved beyond basic search functionality to use Elasticsearch as an analytic platform, answering particular questions, using features such as aggregations.

We continued during the afternoon to develop the various applications and finished with a ’show and tell’. Some of the teams had managed to develop user interfaces for Elasticsearch, the most polished being a clickable Google Map that would show you which types of crime were significantly above and below the national average for the area you selected – unsurprisingly in Cambridge, stolen bicycles were very common! By the end of the day, everyone had gained experience of Elasticsearch, some for the first time. We finished the day, as is traditional, with a swift pint and further networking.

Thanks to Cambridge Business Lounge (a highly recommended co-working space) for the venue, BigStep for hosting and Elasticsearch for sponsoring lunch and providing the swag, and of course to all who attended. We’ll return with a further Cambridge Search Meetup soon!

Cambridge Search Meetup – Knowledge Discovery & Wayfinding

We were lucky enough to have two speakers from Cambridge text mining company Linguamatics at last night’s Meetup. Robin Newton kicked us off with an amusing and idiosyncratic view of the uses and mis-uses of search – starting with the problem that when you have text search software, every problem can look like search might solve it. He gave an example of his recent search for a new job: although matching his skills on paper with a potential employer’s needs is one thing, he also wants to be sure the employer ‘isn’t a crook’! With reference to Tyler Tate’s talks on Information Wayfinding, which in turn quotes urban planner Kevin Lynch, Robin told us how he felt that search ‘journeys’ weren’t always the most efficient way to discover an answer: his assertion was that finding a person who could tell you was more useful. Since even in the most efficient and well-run organisation not all information is held in documents one might agree that finding an ‘expert’ is the best way to get the answers one needs. He finished with a welcome message that informal networking in pubs and cafes (much like our Meetup) helps share a lot of very useful information – and this is how he eventually decided that Linguamatics was going to be a great place to work.

Next was CTO and co-founder of Linguamatics, Dr David Milward, who described his company’s capability in text mining, Natural Language Processing (NLP) and search. He described the challenges of extracting ‘concepts’ from text – how words and acronyms with multiple potential meanings are difficult to parse automatically without contextual knowledge. Linguamatics’ approach has been described as ‘Agile NLP’ and allows the quick development of new patterns for concept extraction. A powerful example he gave was how by specifying a relationship between two entities, in this case one company acquiring another, structured data can be extracted from unstructured text. Other examples focused on the medical and bioscience field (a particular interest of ours at present due to the upcoming BioSolr project) and showed how their software can cluster facts and find connections between disparate pieces of data (‘which X relates to Y via Z’). This process can also be used to generate new facets for searching from free text, including for numeric ranges, and these can even be tailored for different user groups. It’s clear that Linguamatics are experts in this area and David’s talk was of great interest to many in the room, including several from the European Bioinformatics Institute.

We finished with the usual chat, networking and drinks. Thanks to both our speakers – and do let me know if you have a suggestion for a presentation at a future event!

Cambridge Search Meetup – Cassandra & Solr

A sunny evening last night for the latest Cambridge Search Meetup, which featured a couple of talks from Datastax on the highly scalable NoSQL database Apache Cassandra and how it is integrated with Apache Lucene/Solr. Jeremy Hanna started us off with a brief history of the Facebook-incubated Cassandra, which is a fully distributed, highly reliable system used by many including Netflix and Spotify with some customers running thousands of nodes in multiple data centres. Cassandra has its own SQL-like language, CQL3 and some basic collections such as Lists and Maps, but due to its fully distributed nature does lack some traditional features such as JOINs. Datastax themselves are now responsible for most of the ongoing work on Cassandra and offer the usual array of training, support, management services and tools. One common application mentioned was high speed and reliable recording of sensor data, increasingly important now with the rise of the Internet of Things.

After a short break for drinks and snacks (which this time were kindly sponsored by Datastax) Sergio Bossa told us how Solr is integrated with Cassandra, also running in a distributed fashion. Interestingly, this integration doesn’t use the same Zookeeper system as SolrCloud (the standard way to run clusters of Solr servers) but relies instead on Cassandra’s own internal scaling systems, passing data about using ‘gossip‘ between nodes. Zookeeper is not always the easiest thing to get running so an alternative is very interesting! Data can be added to the system over HTTP or the aforementioned CQL3 and after being entered into Cassandra’s tables is subsequently indexed by Solr. Queries can then be made over HTTP as usual. Some work is still necessary to prevent duplication of effort (at present one needs to create data structures in Cassandra and subsequently in Solr).

It was pleasing so see that so much care has been taken with this integration process and also that Datastax offer their Datastax Enterprise Search stack not only free for non-production use, but free to startups. Thanks to Jeremy, Sergio and all who came along and we’ll be back with another Search Meetup soon.

Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz

This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem to affect the attendance too much. We started with a keynote from Dale Roberts, whose book on Decision Sourcing inspired a talk about a ‘rational decision making model’. When examining traditional relational database applications Dale said ‘if you peer at it long enough you can see the rows and columns’ and his point was that modern consumer social networking applications don’t exhibit this old pattern – so this is where search application designers should look for inspiration. His co-presenter Rooven Pakkiri said that Enterprise Search should attempt to ‘release the information from inside our heads’, which of course social networking might help with, connecting you with colleagues. I’m not sure that one can easily take lessons learnt from consumer applications and apply them to business use, and some later speakers agreed with me, but this was a high-energy and thought-provoking start.

Next I chaired the Open Source track, where we started with Cedric Ulmer of France Labs, who talked about a search application they built for a consultancy business with around 40 employees. Using Apache Solr, Apache ManifoldCF and their own Datafari open source framework they turned this project around very quickly – interestingly, the end clients needed no training to use the new system, which implies a very well designed UI. Our second talk from Ronald Hobbs of Reed Business International described a project on a much larger scale: 100 million documents, 72 business units and up to 190 queries per second – this was originally served by the FAST ESP engine but they moved to an Apache Solr system, replacing the FAST processing pipeline with Search Technologies Aspire project. His five steps for an effective migration (Prepare, Get the right tools, Get the right team, Migrate in chunks, Clean up) I can only agree with from our own experience of such projects, including one from FAST ESP to Solr. I was amused by his description of the Apache Zookeeper project as ‘a bipolar manic depressive’, although it seemed this was eventually overcome with a successful deployment on Amazon EC2. Next was Galina Hinova of Intrafind on a aftersales search application for MAN Truck and Bus – again at serious scale (MAN have around 1 billion vehicles in existence with 100-150 documents related to each). Interestingly the Euro6 regulations for emissions and standardized EU terms for automobile parts were direct drivers of the project, with Apache Lucene as the base technology. No longer is open source search just for small-scale projects it seems!

After a short break during which I chatted to John Newton, founder of Documentum Alfresco, and his team we returned to hear Dan Jackson give a description of how UCL had improved their website search – with a chaotic mix of low quality content and an ‘awful’ content management system, the challenges were myriad but with the help of experts such as our associate Tony Russell-Rose they have made significant improvements. Next was what was to prove a very popular talk from Nick Brown of AstraZeneca on a huge, well funded project to build applications to support research and development – again, this was at large scale with 75 million documents (including ‘all the patents and all the research papers’). The key here was their creation of many well-targeted ‘apps’ to enable particular uses of the Sinequa search engine they chose for the back end, including mobile apps to help find others in the company (or external to it) who are also working on a particular drug or disease. This presentation showed just what can be achieved if companies really understand the potential of search technology – knowledge sharing and discovery of previously unknown information.

After a short drinks reception we retired to a nearby pub for the combined Cambridge and London Search Meetup – I’d prepared a short quiz (feel free to have a go!) which was won by Tony Russell-Rose’s team. Networking and chatting continued long into the evening, with some people from the wider UK search community also attending.

To be continued! You can see most of the slides here.

Cambridge Search Meetup – six degrees of ontology and Elasticsearching products

Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ’shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

Search events for Autumn 2013

As usual there’s several interesting events on the horizon this autumn: first up is an London Elasticsearch User Group Meetup on Monday September 9th, which will probably be shortly followed by a Cambridge Search Meetup once I have confirmed a venue – we’re hoping to include a talk on Solr & Hadoop.

November features Lucene Revolution, which this year is in Dublin – there are two days of training on 4th & 5th November followed by the conference itself on the 6th and 7th. We’re hoping to talk at this event if accepted (if you like, you can help us out by voting for our proposed topic “Turning Search upside down: using Lucene for very fast stored queries” which is based on some of our work for clients in the media monitoring sector). No fewer than four of the Flax team will be attending so we hope to catch up with you over a pint of Guinness there!

At the other end of November on the 27th is Search Solutions in London, a day of talks on all aspects of search hosted by the British Computer Society Information Retrieval Specialist Group. I’ll be running a training session the day before on open source search.

Do let us know of any other events in the area of search – we’re as ever very happy to publicise them.

Search Meetups return with news of two search books

Last night the London Search Meetup returned after a year’s absence: it’s great to see it back. The venue was at St Pancras with the room overlooking Eurostar trains and statues, inside the beautifully restored station building.

The speakers were both there to talk about their recent books: prolific author Martin White of Intranet Focus has written a book on Enterprise Search with the strapline ‘Enhancing Business Performance’. Martin has decades of experience in the sector, an enviable collection of war stories from inside the enterprise and was as ever an engaging speaker.

Next up were Tony Russell-Rose and Tyler Tate to talk about their new book which focuses on the user experience of search. ‘Designing the Search Experience’ promises to be a rich resource on how, why and where people use search and how this impacts the design of user interfaces.

The evening ended with some lively discussion and a promise that after its long absence this Meetup will now be happening on a more regular basis. We’re also running our own Cambridge search meetup – the next event is on February 21st where we’ll be hearing about web crawling and scraping. Another date for your diary is the Enterprise Search Europe conference on May 15th & 16th this year – the programme has just been published and features speakers from Ernst & Young and Oracle. I’ll also be running a workshop the day before the conference on Getting the Best from Open Source Search.

Tags: , , , ,

Posted in events

February 12th, 2013

No Comments »

Search Solutions 2012 – a review

Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.

Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.

Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.

Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.

Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.

The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.