Posts Tagged ‘networking’

Cambridge Search Meetup – six degrees of ontology and Elasticsearching products

Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ’shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

Search events for Autumn 2013

As usual there’s several interesting events on the horizon this autumn: first up is an London Elasticsearch User Group Meetup on Monday September 9th, which will probably be shortly followed by a Cambridge Search Meetup once I have confirmed a venue – we’re hoping to include a talk on Solr & Hadoop.

November features Lucene Revolution, which this year is in Dublin – there are two days of training on 4th & 5th November followed by the conference itself on the 6th and 7th. We’re hoping to talk at this event if accepted (if you like, you can help us out by voting for our proposed topic “Turning Search upside down: using Lucene for very fast stored queries” which is based on some of our work for clients in the media monitoring sector). No fewer than four of the Flax team will be attending so we hope to catch up with you over a pint of Guinness there!

At the other end of November on the 27th is Search Solutions in London, a day of talks on all aspects of search hosted by the British Computer Society Information Retrieval Specialist Group. I’ll be running a training session the day before on open source search.

Do let us know of any other events in the area of search – we’re as ever very happy to publicise them.

Search Meetups return with news of two search books

Last night the London Search Meetup returned after a year’s absence: it’s great to see it back. The venue was at St Pancras with the room overlooking Eurostar trains and statues, inside the beautifully restored station building.

The speakers were both there to talk about their recent books: prolific author Martin White of Intranet Focus has written a book on Enterprise Search with the strapline ‘Enhancing Business Performance’. Martin has decades of experience in the sector, an enviable collection of war stories from inside the enterprise and was as ever an engaging speaker.

Next up were Tony Russell-Rose and Tyler Tate to talk about their new book which focuses on the user experience of search. ‘Designing the Search Experience’ promises to be a rich resource on how, why and where people use search and how this impacts the design of user interfaces.

The evening ended with some lively discussion and a promise that after its long absence this Meetup will now be happening on a more regular basis. We’re also running our own Cambridge search meetup – the next event is on February 21st where we’ll be hearing about web crawling and scraping. Another date for your diary is the Enterprise Search Europe conference on May 15th & 16th this year – the programme has just been published and features speakers from Ernst & Young and Oracle. I’ll also be running a workshop the day before the conference on Getting the Best from Open Source Search.

Tags: , , , ,

Posted in events

February 12th, 2013

No Comments »

Search Solutions 2012 – a review

Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.

Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.

Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.

Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.

Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.

The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.

Cambridge Search Meetup – Search for publication success and low-cost apps

After a short break the Cambridge Search Meetup returned last night with our usual mix of presentations, questions, networking, beer and snacks. We had a few issues with the projector and cables (one of these is on the shopping list for next time) so thanks to both presenters and audience for their patience!

First up was Liang Shen with a description of Journal Selector, a system for helping those publishing academic papers to find the correct journals to approach. The system allows one to copy and paste a chunk of a paper to a website and find which journals best match the subject matter, based on what they have published in the past. Running on the Amazon EC2 cloud the service indexes journals from feeds, HTML webpages and other sources, processes and stores this data in Amazon’s Hadoop-compatible database, indexes it with Apache Solr and then presents the results via the Drupal CMS. The results are impressive, allowing users to see exactly on what basis the system has recommended a journal to approach. You can see the presentation slides here.

Next was Rich Marr, who bravely offered to live-code a demonstration of his low-cost prototyping methodology for startups needing both NoSQL data storage and search across this data. In only 20 lines or so of code he showed us how to use Node.js to build a simple server that could accept messages (over Telnet, although HTTP or even IMAP would be as easy), store them in a CouchDB database and index them for searching (using a different message) with Elasticsearch. Rich’s demo prompted a lively discussion of how commoditized and componentized search technology is becoming, with open source components that allow one to build a prototype search engine in minutes.

Thanks to both our speakers – and the Meetups continue, with Rich Marr’s own London Open Source Search Social meeting on Tuesday 23rd October, and in Cambridge the Data Insights Meetup where I’ll be talking on November 1st.

An open day on open source search from Sirius & Flax

We spent Friday at the riverside offices of Sirius Corporation, our support partners, for the first and hopefully not the last of their Open Days on open source enterprise search. We were lucky to have Mike Davis, a very well known and highly experienced analyst to open the talks – despite suffering from flu he gave an engaging talk on why open source enterprise search software should be your first port of call, and how you should only consider closed source options when you need particular features they provide.

We then gave a quick Introduction to Open Source Search, detailing the various packages available (from Apache Lucene/Solr to Xapian and Sphinx) and showing a quick Solr-powered demo we’d built to search some pages from the BBC Music website. Using the programmer’s first choice for an example query (the ever reliable ‘foo*’) we discovered the wonderfully named Original Rabbit Foot Spasm Band – which interestingly you can’t find via the BBC’s own site search engine due to lack of wildcard support.

Andrew Savory, Sirius’ CTO and Apache Foundation member, then gave a presentation on what an Apache project actually is and how best to engage with an open source community – very useful for those considering open source for the first time. The morning finished with a delicious barbeque on the riverbank provided by Sirius. We thought the event went very well and we’d love to confirm the rumour that this will become a regular event. Thanks to all at Sirius for organising and hosting the day and we look forward to returning.

Search Meetup Cambridge – Challenges of Unstructured Data

Another Cambridge Search Meetup this week, with two speakers on unstructured data, plus the usual networking, beer and snacks. We started with Dean Yearsley of Pingar talking and bravely attempting a live demo of their API, which amongst other things has facilities for entity extraction in multiple languages including English, Chinese and Japanese. The Pingar system is written in .Net and thus unsurprisingly plays well with Sharepoint: Dean demonstrated it automatically providing extra metadata for Sharepoint items, especially useful if a new column has been added to a Sharepoint store, as it would be tedious for operators to have to add data for this column to each item manually.

Jordan Hrycaj of 7Safe, recently acquired by PA Consulting, was up next to talk about what he described as ‘ad-hoc’ search – for use in digital forensics or digital discovery applications. The application he described can be used to search the hard disks of suspect PCs or servers for information such as credit card numbers extremely quickly, working at a low level to avoid leaving any impression on the data (i.e., no file timestamps are altered) and usually working on live systems. This system is command line based, tiny in size and portable across operating systems and is an impressive way to cut down the likely candidates for a data security breach. It was fascinating to hear about a way to search that doesn’t depend on indexing, and the compromises made for performance reasons (i.e., regular expressions can be used but without wildcards).

Thanks to both speakers and to all who came to hear them. We already have some more talks lined up so we expect the next Meetup to be sooner rather than later!

Search events for 2012 – the first crop

Details of search events in 2012 are beginning to appear already, here’s a few to start with:

  • 1-5 April 2012 – European Conference on Information Retrieval (ECIR) in Barcelona, Spain. An academic conference featuring new developments in IR.
  • 7-10 May 2012 – Lucid Imagination’s Lucene Revolution in Boston, USA. The largest conference on open source search – this event has a great buzz as the Lucene/Solr community continues to grow.
  • 30/31 May 2012 – Enterprise Search Europe in London, after a successful first event last year. Great for those planning or working on enterprise search projects.

More to come as we hear about them – we’ll also be running another Cambridge Search Meetup soon.

Tags: , , ,

Posted in events

January 25th, 2012

No Comments »

Search Solutions 2011 review

I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.

The Fall and rise of search in a world of Big Data – part 2

The theme of Big Data continued at the next conference I attended, the first Enterprise Search Europe held in London. There was a good mix of presentations ranging from the academic to the practical, my favourite probably being Martin Belam and colleague’s talk about using Solr to dynamically generate content for the new Guardian Books site. I was lucky enough to be able to talk about the real business benefits of open source search along with one of our customers, Stephen Wicks, CTO of Gorkana Group, which drew some interesting questions. We also ran a combined Meetup on the Monday evening, combining Enterprise Search Cambridge with Enterprise Search London.

There did seem to be a rather negative spin on search from many presenters – saying that search technology is misunderstood, more costly than expected, rarely works and hasn’t seen much recent innovation. Some of this is true – but I see this as an opportunity rather than a problem. There is more focus on the world of search now than before due to some high-profile acquisitions; people are questioning the value and capability of search technology. Those of us working at the cutting edge, delivering real working solutions, should perhaps take this opportunity to say that yes, it can be done, at a sensible cost, and it can deliver real business benefit. Perhaps as we move further into the world of Big Data we’ll realise the true value of effective search.

Tags: , , , ,

Posted in events

October 31st, 2011

No Comments »