Archive for the ‘events’ Category

ISKO UK – Taming the News Beast

I spent yesterday afternoon at UCL for ISKO UK’s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it up in a nice secure ontology. There were around 90 people attending from news, content, technology and academic organisations, including quite a few young journalism students visiting London from Missouri.

The first talk was by Matt Shearer of BBC News Labs who described how they are working on automatically extracting entities from video/audio content (including verbatim transcripts, contributors using face/voice recognition, objects using audio/image recognition, topics, actions and non-verbal events including clapping). Their prototype ‘Juicer’ extractor currently works with around 680,000 source items and applies 5.7 million tags – which represents around 9 man years for a manual tagger. They are using Stanford NLP and DBpedia heavily, as well as an internal BBC project ‘Mango’ – I hope that some of the software they are developing is eventually open sourced as after all this is a publically-funded broadcaster. His colleague Jeremy Tarling was next and described a News Storyline concept they had been working on a new basis for the BBC News website (which apparently hasn’t changed much in 17 years, and still depends on a lot of manual tagging by journalists). The central concept of a storyline (e.g. ‘US spy scandal’) can form a knowledge graph, linked to events (‘Snowden leaves airport’), videos, ‘explainer’ stories, background items etc. Topics can be used to link storylines together. This was a fascinating idea, well explained and something other news organisations should certainly take note of.

Next was Rob Corrao of LAC Group describing how they had helped ABC News revolutionize their existing video library which contains over 2 million assets. They streamlined the digitization process, moved little-used analogue assets out of expensive physical storage, re-organised teams and shift patterns and created a portal application to ease access to the new ‘video library as a service’. There was a focus on deep reviews of existing behaviour and a pragmatic approach to what did and didn’t need to be digitized. This was a talk more about process and management rather than technology but the numbers were impressive: at the end of the project they were handling twice the volume with half the people.

Ian Roberts from the University of Sheffield then described AnnoMarket, a cloud-based market platform for text analytics, which wraps the rather over-complex open source GATE project in an API with easy scalability. As they have focused on precision over recall, AnnoMarket beats other cloud-based NLP services such as OpenCalais and TextRazor in terms of accuracy, and can process impressive volumes of documents (10 million in a few hours was quoted). They have developed custom pipelines for news, biomedical and Twitter content with the former linked into the Press Association’s ontology (PA is a partner in AnnoMarket). For those wanting to carry out entity extraction and similar processes on large volumes of content at low cost AnnoMarket certainly looks attractive.

Next was Pete Sowerbutts of PA on the prototype interface he had helped develop for tagging all of PA’s 3000 daily news stories with entity information. I hadn’t known how influential PA is in the UK news sector – apparently 30% of all UK news is a direct copy of a PA feed and they estimate 70% is influenced by PA’s content. The UI showed how entities that have been automatically extracted can be easily confirmed by PA’s staff, allowing for confirmation that the right entity is being used (the example being Chris Evans who could be both a UK MP, a television personality and an American actor). One would assume the extractor produces some kind of confidence measure which begs the question whether every single entity must be manually confirmed – but then again, PA must retain their reputation for high quality.

The event finished with a brief open discussion featuring some of the speakers on an informal panel, followed by networking over drinks and snacks. Thanks to all at ISKO especially Helen Lippell for organising what proved to be a very interesting day.

London Search Meetup – Serious Solr at Bloomberg & Elasticsearch 1.0

The financial information service Bloomberg hosted last Friday’s London Search Meetup in their offices on Finsbury Square – the venue had to be seen to be believed, furnished as it is with neon, chrome, modern art and fishtanks. A slight step up from the usual room above a pub! The first presenter was Ramkumar Aiyengar of Bloomberg on their new search system, accessed via the Bloomberg terminal (as it seems is everything else – Ramkumar even opened his presentation file and turned off notifications from his desk phone from within this application).

Make no mistake, Bloomberg’s requirements are significant: 900,000 new stories from 75,000 sources and 8 million manual searches every day with another 350,000 stored searches running automatically. Some of these stored searches are Boolean expressions with up to 20,000 characters and the source data is also enhanced with keywords from a list of over a million tags. Access Control Lists (ACLs) for security and over 40 languages are also supported, with new stories becoming searchable within 100ms. What is impressive is that these requirements are addressed using the open source Apache Lucene/Solr engine running 256 index shards, replicated 4 times for a total of 1024 cores, on a farm of 32 servers each with 256GB of RAM. It’s interesting to wonder if many closed source search engines could cope at all at this scale, and slightly scary to think how much it might cost!

Ramkumar explained how achieving this level of performance had led them to expose (and help to fix) quite a few previously unknown race conditions in Solr. His team had also found innovative ways to cope with such a large number of tags – each has a confidence value, say 70%, and this can be used to perform a kind of TF/IDF ranking by effectively adding 70 copies of the tag to a document. They have also developed an XML-based query parser for their in-house query syntax (althought in the future the JSON format may be used) and have contributed code back to Solr (for those interested, Bloomberg have contributed to SOLR-839 and are also looking at SOLR-4351).

For the monitoring requirement, we were very pleased to hear they are building an application based on our own Luwak stored query engine, which we developed for just this sort of high-performance application – we’ll be helping out where we can. Other future plans include relevance improvements, machine translation, entity search and connecting to some of the other huge search indexes running at Bloomberg, some on the petabyte scale.

Next up was Mark Harwood of Elasticsearch with an introduction to some of the features in version 1.0 and above. I’d been lucky enough to see Mark talk about some of these features a few weeks before so I won’t repeat myself here, but suffice it to say he again demonstrated the impressive new Aggregrations feature and raised the interesting possibility of market analysis by aggregating over a set of logged queries – identifying demand from what people are searching for.

Thanks to Bloomberg, Ramkumar, Mark and Tyler Tate for a fascinating evening – we also had a chance to remind attendees of the combined London & Cambridge Search Meetup on April 29th to coincide with the Enterprise Search Europe conference (note the discount code!).

Cambridge Search Meetup – six degrees of ontology and Elasticsearching products

Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ’shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

Convergence and collisions in Enterprise Search

At the end of next month I’ll be at Enterprise Search Europe (I’m on the programme committee and help with the open source track) and the opening keynote this year is from Dale Roberts, author of the book Decision Sourcing. Dale will be talking about how Social, Big Data, Analytics and Enterprise Search are on a collision course and business leaders ignore these four themes at their peril.

So I wondered if we could see how in practical terms one might build systems based on these four themes. There are technical and logistical challenges of course (not least convincing someone to pay for the effort) but it’s worth exploring nonetheless.

Social in a business context can mean many things: social media is inherently noisy (and as far as I can see mostly cats) but when social tools are used within a business they can be a great way to encourage collaboration. We ourselves have added social features to search applications – user tagging of search results for example, to improve relevance for future searches and to help with de-duplication. Much has been made of the idea of finding not just relevant documents, but the subject matter experts that may have written them, or just other people in your organisation who are interested in the same subject. From a technical point of view none of this is particularly hard – you just have to add these social signals to your index and surface them in some intuitive way – but getting a high enough percentage of users to contribute to shared discussions and participate in tagging can be difficult.

Big Data is an overused term – but in a business context people usually apply it to very large collections of log files or other data showing how your customers are interacting with your business. A lot of search engine experts will tell you that Big Data isn’t always that ‘big’ – we’ve been dealing with collections of hundreds of millions or even billions of indexed items for many years now, the trick is scaling your solution appropriately (not just in technical terms, but in an economic way, as linearly as possible). If you’ve got a few million items, I’m sorry but you haven’t got Big Data, you’ve just got some data.

I’ve always been unsure of the benefits of search Analytics but I’m beginning to change my mind, having seen a some very impressive demos recently. Search engines have always counted things; the clever bit is allowing for queries that can surface unusual or interesting information, and using modern visualisation techniques to show this. Knowing the most popular search term may not be as important as spotting an unexpected one.

So we’ve indexed our data including tags, personnel records, internal chatrooms; put them all onto a elastically scalable platform and built some intuitive and useful interfaces to search and analyze our data. I’m pretty sure you could do all this with the open source technologies we have today (including Scrapy, Apache Lucene/Solr, Elasticsearch, Apache Hadoop, Redis, Logstash, Kibana, JQuery, Dropwizard, Python and Java). This isn’t the whole story though: you’d need a cross-disciplinary team within your organisation with the ability to gather user requirements and drive adoption, a suitable budget for prototyping, development and ongoing support and refinements to the system and a vision encompassing the benefits that it would bring your business. Not an inconsiderable challenge!

What questions should we be able to ask the system? I’ll leave that as an exercise for the reader.

See you in April! If you’d like a 20% discount on registration use the code HULL20. We’ll also be running an evening Meetup on Tuesday 29th April open to both conference attendees and others.

ElasticSearch London Meetup – a busy and interesting evening!

I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene’s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

Search events for 2014

Here’s a few search-related events over the next few months for your consideration:

  • On Tuesday 1st April the International Society for Knowledge Organisation are holding a seminar in London on ‘Taming the News Beast‘ with contributions from the BBC and Press Association amongst others. We’ll be attending as many of our clients are from the news sector.
  • On 29th-30th April (with workshops the day before) we have Enterprise Search Europe, now at a new (slightly more central) London venue and with presentations from Ernst & Young, Reed Elsevier, MAN Truck & Bus, AstraZeneca and the University of London – do take a look at the really strong programme this year. On the Monday I’ll be repeating my workshop on Getting the Best from Open Source Search for those interested in planning and/or implementing an open source search application. I’m very pleased to be able to offer a 20% discount on registration fees – just use the code HULL20 when you apply.
  • Berlin Buzzwords is held on May 25th-28th with the usual mix of talks on Search, Store and Scale – this is always a popular event and we expect someone from Flax will attend.
  • I’ll post up more events as they are announced – we’re also hoping to hold another Cambridge Search Meetup soon. Do let me know if you’d like to meet up at any of the events above!

    Tags: ,

    Posted in events

    January 21st, 2014

    No Comments »

Search Solutions 2013, a review

Yesterday was the always interesting Search Solutions one day conference held by the BCS IRSG in London, a mix of talks on different aspects of search. The first presentation was by Behshad Behzadi of Google on Conversational Search, where he showed a speech-capable search interface that allowed a ‘conversation’ with the search engine – context being preserved – so the query “where are Italian restaurants in Chelsea” followed by “no I prefer Chinese” would correctly return results about Chinese restaurants. The demo was impressive and we can expect to see more of this kind of technology as smartphone adoption rises. Wim Nijmeijer of Coveo followed with details of how their own custom connectors to a multitude of repositories could enable Complex enterprise search delivered in a day. This of course assumes that no complex mapping of fields or schemas from the source to the search engine index is necessary, which I suspect it often is – I’m not alone in being slightly suspicious of the supposed timescale. Nikolaos Nanas from Thessaly in Greece then presented on Adaptive Information Filtering: from theory to practise which I found particularly interesting as it described filtering documents against a user’s interest with the latter modelled by an adaptive, weighted network – he showed the Noowit personalised magazine application as an example. With over 1000 features per user and no language specific requirements this is a powerful idea.

After a short break we continued with a talk by Henning Rode on CV Search at TextKernel. He described a simple yet powerful UI for searching CVs (resumes) with autosuggest and automatic field recognition (type in “Jav” and the system suggests “Java” and knows this is a programming language or skill). He is also working on systems to autogenerate queries from job vacancies using heuristics. We’ve worked in the recruitment space ourselves so it was interesting to hear about their approach, although the technical detail was light. Following Henning was Dermot Frost talking about Information Preservation and Access at the Digital Repository of Ireland and their use of open source technology including Solr and Blacklight to build a search engine with a huge variety of content types, file formats and metadata standards across the items they are trying to digitally preserve. Currently this is a relatively small collection of data but they are planning to scale up over the next few years: this talk reminded me a little of last year’s by Emma Bayne of the UK’s National Archive.

After lunch we began a session named Understanding the User, beginning with Filip Radlinski of Microsoft Research. He discussed Sensitive Online Search Evaluation (with arXiv.org as a test collection) and how interleaved results is a powerful technique for avoiding bias. Next was Mounia Lalmas of Yahoo! Labs on what makes An Engaging Click (although unfortunately I had to pop out for a short while so I missed most of what I am sure was a fascinating talk!). Mags Hanley was next on Understanding users search intent with examples drawn from her work at TimeOut – the three main lessons being to know the content in context, the time of year and the users’ mental model in context. Interestingly she showed how the most popular facets used differed across TimeOut’s various international sites – in Paris the top facet was perhaps unsurprisingly ‘cuisine’, in London it was ‘date’.

After another short break we continued with Helen Lippell’s talk on Enterprise Search – how to triage problems quickly and prescribe the right medicine – her five main points being analyze user needs, fix broken content, focus on quick wins in the search UI, make sure you are able to tweak the search engine itself in a documentable fashion and remember the importance of people and process. Her last point ‘if search is a political football, get an outsider perspective’ is of course something we would agree with! Next was Peter Wallqvist of Ravn Systems on Universal Search and Social Networking where he focussed on how to allow users to interact directly with enterprise content items by tagging, sharing and commenting – so as to derive a ‘knowledge graph’ showing how people are connected by their relationships to content. We’ve built systems in the past that have allowed users to tag items in the search result screen itself so we can agree on the value of this approach. Our last presenter with Kristian Norling of Findwise on Reflections on the 2013 Enterprise Search Survey – some more positive news this year, with budgets for search increasing and 79% of respondents indicating that finding information is of high importance for their organisation. Although most respondents still have less than one full time staff member working on search, Kristian made the very good point that recruiting just one extra person would thus give them a competitive advantage. Perhaps as he says we’ve now reached a tipping point for the adoption of properly funded enterprise search regarded as an ongoing journey rather than a ‘fire and forget’ project.

The day finished with a ‘fishbowl’ session, during which there was a lot of discussion of how to foster links between the academic IR community and industry, then the BCS IRSG AGM and finally a drinks reception – thanks to all the organisers for a very interesting and enlightening day and we look forward to next year!

Lucene Revolution 2013, Dublin: day 1

Four of the Flax team are in Dublin this week for Lucene Revolution, almost certainly the largest event centred on open source search and specifically Lucene. There are probably a couple of hundred Lucene enthusiasts here and the event is being held at the Aviva Stadium on Landsdowne Road: look out the windows and you can see the pitch! Here are some personal reflections: a number of the talks I attended today have a connection to our own work in media monitoring which we’re talking about tomorrow.

Doug Turnbull’s Test Driven Relevancy was interesting, discussing OSC’s Quepid tool that allows content owners and search experts to work together to tweak and tune Solr’s options to present the right results for a query. I wondered whether this tool might eventually be used to develop a Learning to Rank option for Solr, as Lucene 4 now supports a pluggable scoring model.

I enjoyed Real-Time Inverted Search in the Cloud Using Lucene and Storm during which Joshua Conlin told us about running hundreds of thousands of stored queries in a distrubuted architecture. Storm in particular sounds worth investigating further. There is currently no attempt to reduce or ‘prune’ the set of queries before applying them: Joshua quoted speeds of 4000 queries/sec across their cluster of 8 instances: impressive numbers, but our own monitoring applications are working at 20 times that speed by working out which queries not to apply.

I broke out at this point to catch up with some contacts, including the redoubtable Iain Fletcher of Search Technologies – always a pleasure. After a sandwich lunch I went along to hear Andrzej Bialecki of Lucidworks talk about Sidecar Indexes, a method for allowing rapid updates to Lucene fields. This reminded me of our own experiments in this area using Lucene’s pluggable codecs.

Next was more from the Opensource Connections team, as John Berryman talked about their work to update a patent search application that uses a very old search syntax, BRS. This sounds very much the work we’ve done to translate one search engine syntax into another for various media monitoring companies – so far we can handle dtSearch and we’re currently finishing off support for HP/Autonomy Verity’s VQL (PDF).

This latter issue has got me thinking that perhaps it might be possible to collaboratively develop an open source search engine query language – various parsers could be developed to turn other search syntaxes into this language, and search engines like Lucene (or anything else) could then be extended to implement support for it. This would potentially allow much easier migration between search engine technologies. I’m discussing the concept with various folks at the event this week so do please get in touch if you are interested!

Back tomorrow with a further update on this exciting conference – tonight we’re all off to the Temple Bar area of Dublin for food and drink, generously provided by Lucidworks who should also be thanked for organising the Revolution.

Tags: , , , , , ,

Posted in Technical, events

November 6th, 2013

3 Comments »

Solr and the changing landscape of search

This morning I was told about the launch of a new US-based search company, Heliosearch, founded by the creator of Apache Solr, Yonik Seeley. It seems the landscape of open source search and in particular Solr is changing again – Heliosearch are planning their own ‘certified’ distribution of Solr plus a raft of support, consulting and services. In the meantime, the company Yonik co-founded (and our partners) LucidWorks are recently launched an ‘App Store’ for search, the Solr Marketplace, offering add-ons to the core engine from both themselves and others.

What we’re seeing here is the further growth of an ecosystem based around what has almost become the default choice for new and migrating search applications. Some clients will want a packaged distribution of Solr, some will be happy to download the source from Apache, some will need help getting started and some will just need help when things get complicated, or support for a running application. We’ve seen all of these requirements and more in the last year.

Next week the largest conference on open source search, Lucene Revolution is held in Dublin, and four of the Flax team are attending. Do let us know if you’d like to meet up – I don’t think there’s going to be a lack of things to talk about!

Tags: , , , ,

Posted in News, events

October 29th, 2013

No Comments »

Elasticsearch meetup – Duedil, Hadoop and more

I visited the London Elasticsearch User Groupsmeetup last night for the first time, in the rather splendid HQ of Skills Matter just down from Old Street – the venue had a great buzz. The first speaker was Chris Simpson from Duedil who provide UK company information gleaned from Companies House and other sources. He told us about using Elasticsearch to provide faceted search (including some great clickable bar graphs for numerical range facets) and how they bulk index around 9 million company records in about an hour, using Elasticsearch’s alias features to swap in new indexes once they’re ready – so there is no impact on search performance while indexing. He mentioned a common problem with search engines, which is there is no easy way to be sure how much hardware you’ll need until you ‘know your data and know your hosts’.

Next up was Chris Harris from Hortonworks, who provide a packaged and supported Apache Hadoop distribution. He explained how Hadoop can be used for capturing huge numbers of transactions (these could be interactions with an e-commerce website for example) and for storing them in a distributed database on low-cost hardware. The Hive ‘SQL-like’ language can then be used to extract the data and send it directly to Elasticsearch, or indeed to run queries on Elasticsearch and send the results back to Hadoop as a table. Similar processes can be run with the Pig scripting language. There followed some interesting discussions about the future of Hadoop, where search engines such as Elasticsearch may run directly on Hadoop nodes, working with the data locally. It will be interesting to compare this with the approach taken by Cloudera who are talking on Hadoop & Solr this Thursday at our own Meetup in Cambridge.

Clinton Gormley from Elasticsearch finished up with a Q&A, during which he talked about the new Phrase Suggesters based on Lucene’s new Finite State Machines, and gave hints about when the long awaited 1.0 release of Elasticsearch will appear – apparently early 2014 is now likely.

Thanks to all the speakers and to Elasticsearch for the very welcome beer and pizza – this certainly won’t be our last visit to this user group on what is an increasingly adopted open source search engine.