meetup – Flax

London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs

Charlie Hull — Thu, 03 May 2018 09:47:48 +0000

Elsevier were our kind hosts for the latest London Lucene/Solr Meetup and also provided the first speaker, Peter Cotroneo. Peter spoke about their DataSearch project, a search engine for scientific data. After describing how most other data search engines only index and rank results using metadata, Peter showed how Elsevier’s product indexes the data itself and also provides detailed previews. DataSearch uses Apache NiFi to connect to the source repositories, Amazon S3 for asset storage, Apache Spark to pre-process the data and Apache Solr for search. This is a huge project with many millions of items indexed.

Relevance is a major concern for this kind of system and Elsevier have developed many strategies for relevance tuning. Features such as highlighting and auto-suggest are used, lemmatisation rather than stemming (with scientific data, stemming can cause issues such as turning ‘Age’ into ‘Ag’ – the chemical symbol for silver) and a custom rescoring algorithm that can be used to promote up to 3 data results to the top of the list if deemed particularly relevant. Elsevier use both search logs and test queries generated by subject matter experts to feed into a custom-built judgement tool – which they are hoping to open source at some point (this would be a great complement to Quepid for test-based relevance tuning)

Peter also described a strategy for automatic optimization of the many query parameters available in Solr, using machine learning, based on some ideas first proposed by Simon Hughes of dice.com. Elsevier have also developed a Phrase Service API, which helps improve phrase based search over the standard un-ordered ‘bag of words’ model by recognising acronyms, chemical formulae, species, geolocations and more, expanding the original phrase based on these terms and then boosting them using Solr’s query parameters. He also mentioned a ‘push API’ available for data providers to push data directly into DataSearch. This was a necessarily brief dive into what is obviously a highly complex and powerful search engine built by Elsevier using many cutting-edge ideas.

Our next speaker, Michael Hardwick of Elite Software, talked about how textual data is stored in PDF files and the implications for extracting this data for search applications. In an engaging (and at some times slightly horrifying) talk he showed how PDFs effectively contain instructions for ‘painting’ characters onto the page and how certain essential text items such as spaces may not be stored at all. He demonstrated how fonts are stored within the PDF itself, how character encodings may be deliberately incorrect to prevent copy-and-paste operations and in general how very little if any semantic information is available. Using newspaper content as an example he showed how reading order is often difficult to extract as the PDF layout is a combination of the text from the original author and how it has been laid out on the page by an editor – so the headline may be have been added after the article text, which itself may have been split up into sections.

Tables in PDFs were described as a particular issue when attempting to extract numerical data for re-use – the data order may not be in the same order as it appears, for example if only part of a table is updated each week a regular publication appears. With PDF files sometimes compressed and encrypted the task of data extraction can become even more difficult. Michael laid out the choices available to those wanting to extract data: optical character recognition, a potentially very expensive Adobe API (that only gives the same quality of output as copy-and-paste), custom code as developed by his company and finally manual retyping, the latter being surprisingly common.

Thanks to both our speakers and our hosts Elsevier – we’re planning another Meetup soon, hopefully in mid to late June.

The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.

London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco

Charlie Hull — Thu, 08 Feb 2018 14:55:22 +0000

This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you!

Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.

Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.

Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

London Lucene/Solr Meetup – Introducing Marple & Solr Classification

Charlie Hull — Mon, 27 Mar 2017 13:16:36 +0000

A small crowd for this month’s London Lucene/Solr Meetup, kindly hosted by Barclays in their sumptuous Canary Wharf offices. I introduced the Meetup and spoke briefly on how Flax is currently looking for team members (want to work on a variety of cutting-edge open source search projects in the UK and abroad? Get in touch!) before introducing Flax’s Alan Woodward who introduced our new Lucene index inspection tool, Marple.

Alan told us how Marple was conceived at the Lucene4IR event in Glasgow last year and how coding started at our Lucene Hackday in London. Although the well-known tool Luke allows one to dive deep into Lucene indexes, it hasn’t kept up with recent additions to Lucene index structures and we also wanted to build a tool with a RESTful API and separate GUI to allow it to be run easily on our client’s indexes in a read-only mode. Alan demonstrated Marple’s features including how it allows one to see the ‘hidden’ Lucene index fields that Elasticsearch creates. The first release of Marple is out and we’d welcome any feedback and contributions.

Next up was Alessandro Benedetti with an engaging talk about Solr’s built-in document classification features, useful for everything from spam filtering to automatic product categorisation. Unlike many classification methods, this uses the Lucene index itself as the training set – this index must contain some documents with manually assigned classification fields. Either K-Nearest-Neighbour and Naive Bayes algorithms can be used to perform the classification via Solr’s UpdateRequestProcessor chain, in Solr versions after 6.1. You can read more detail on Alessandro’s excellent blog.

We concluded with a brief Q&A session and then popped downstairs to a pub for some snacks and drinks. Thanks to both our speakers, our hosts and all who came – we’ll return in a couple of months with talks that will include René Kriegler on his neat Querqy query processor.

The post London Lucene/Solr Meetup – Introducing Marple & Solr Classification appeared first on Flax.

Just the facts with Solr & Luwak

Charlie Hull — Wed, 04 Jan 2017 15:58:19 +0000

It won’t have escaped your notice that factchecking is very much in the news recently due to last year’s political upheavals in both the US and UK and the suspected influence of fake news on voters. Both traditional and social media organisations are making efforts in this area; examples include Channel 4 and Facebook.

At our recent London Lucene/Solr Meetup UK charity Full Fact spoke eloquently on the need for automated factchecking tools to help identify and correct stories that are demonstrably false. They’ve also published a great report on The State of Automated Factchecking which mentions both Apache Solr and our powerful stored query library Luwak as components of their platform. We’ve been helping FullFact with their prototype factchecking tools for a while now but during the Meetup I suggested we might run a hackday to develop these further.

Thus I’m very pleased to announce that Facebook have offered us a venue in London for the hackday on January 20th (register here). Many Solr developers, including several committers and PMC members, are signed up to attend already. We’ll use Full Fact’s report and their experiences of factchecking newspapers, TV’s Question Time and Hansard to design and build practical, useful tools and identify a future roadmap. We’ll aim to publish what we build as open source software which should also benefit factchecking organisations across the world.

If you’re concerned about the impact of fake news on the political process and want to help, join the Meetup and/or donate to Full Fact.

The post Just the facts with Solr & Luwak appeared first on Flax.

A tale of two cities (and two Lucene Hackdays)

Charlie Hull — Fri, 21 Oct 2016 10:27:00 +0000

To mark Flax’s 15th anniversary we ran two Lucene Hackdays recently, in London and Boston. I even made some Flax cakes! The London event was attended by around 20 people from companies both large and small and kindly hosted by Bloomberg (who are currently very active in the Lucene/Solr community). We split up into a number of groups to work on a range of projects. Erica Sundberg from Blackrock took a group of beginners through installing Solr and indexing their first collection, while also considering how a minimal Solr example could be built (some of the shipped examples being rather complex). Another team led by Christine Poerschke of Bloomberg looked at a way to avoid slightly different statistics being returned from different Solr replicas (which can cause result ordering to appear to ‘jump’) and Diego Ceccarelli looked at adding BM25F ranking to Lucene. Other groups looked at SQL streaming with Solr (committer Joel Bernstein dialed in via Skype to help) and Flax’s Alan Woodward worked on Marple, a browser-based explorer for Lucene indexes. The day finished with a curry dinner kindly sponsored by Alfresco.

Several days later we ran a similar Hackday in Boston, as many Lucene people were in town for Lucene Revolution. Many more Lucene/Solr committers attended this time and enjoyed a chance to work on their own projects or to continue some of the work we’d started in London. Doug Turnbull came up with a way to do BM25F ranking with existing Lucene features while Alexandre Ravalovitch and I had a long conversation about minimal Solr examples and improving the way beginners can start with Solr. Other projects included new field types for Lucene, improved highlighters and DocValues. BA Insight were kind enough to provide the venue and Lucidworks sponsored drinks and snacks later in the pub downstairs.

We’ve gathered notes on what we worked on with links to some of the software we developed here – please do get involved if you can! In particular the Marple project is attracting further contributions (and interest from those who developed and maintain the existing Luke Lucene index inspector).

I’d like to thank everyone who came to the Hackdays, our generous sponsors for providing venues, food and drink and to those who helped organise the events. The feedback has been excellent (and do let us know if you have any further comments) and people seem keen for this to be a regular event before the annual Lucene Revolution conference – a chance to work on Lucene-based projects outside of regular work, to meet, network and spend time with other contributors and to enjoy being part of a great open source community. We’ll be back!

The post A tale of two cities (and two Lucene Hackdays) appeared first on Flax.

Search and other events for Autumn 2012

Charlie Hull — Tue, 18 Sep 2012 10:44:05 +0000

The diary is beginning to fill up – here are a few events we’ll be involved with over the next few months. Firstly we’re running another Cambridge Search Meetup on October 17th – this is an informal gathering of people interested in search, we have one great talk already on ‘Making search accessible to low cost apps’ and another to be confirmed, plus snacks, beer and even some live music afterwards. If you’re in Cambridge or nearby (it’s only an hour or so from London by train) do come along.

We’ll be briefly visiting the trade stands at FIBEP 2012 on October 4th in the historic town of Krakow, Poland – this is part of a major media monitoring event, the 45th FIBEP Congress. We’re looking forward to meeting companies in the media monitoring sector and talking about some of our projects in that area.

On November 29th we’re planning to attend Search Solutions 2012 in at the BCS in Covent Garden, London – this is an excellent one-day event on all the technical aspects of search. You can read my review of last year’s event to find out more about what to expect.

There’s sure to be more to come!

The post Search and other events for Autumn 2012 appeared first on Flax.

Enterprise Search Europe & a SuperSized Search Meetup

Charlie Hull — Fri, 22 Jul 2011 09:57:46 +0000

We’ve been helping to organise a new conference to be held in London this October, Enterprise Search Europe. This two-day event promises to give a ‘European perspective on the technology, selection, implementation and optimisation of enterprise-scale search’ and features speakers from 3i plc, Logica, The Guardian and a number of search providers such as Findwise, Funnelback and ourselves (I’ll be talking on ‘Building a Strong Business Foundation with Open Source Search’ on the second day).

It’s going to be a busy time as I’m also chairing a panel on the first day and helping run the evening reception, which is co-hosted by the London and Cambridge Search Meetups – this is likely to be one of the largest Search Meetups ever and is sure to be a fascinating evening, featuring speakers from the conference in an informal setting (i.e., a pub!).

Hope to see some of you there.

The post Enterprise Search Europe & a SuperSized Search Meetup appeared first on Flax.