spark – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/ http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/#respond Thu, 03 May 2018 09:47:48 +0000 http://www.flax.co.uk/?p=3812 Elsevier were our kind hosts for the latest London Lucene/Solr Meetup and also provided the first speaker, Peter Cotroneo. Peter spoke about their DataSearch project, a search engine for scientific data. After describing how most other data search engines only … More

The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.

]]>
Elsevier were our kind hosts for the latest London Lucene/Solr Meetup and also provided the first speaker, Peter Cotroneo. Peter spoke about their DataSearch project, a search engine for scientific data. After describing how most other data search engines only index and rank results using metadata, Peter showed how Elsevier’s product indexes the data itself and also provides detailed previews. DataSearch uses Apache NiFi to connect to the source repositories, Amazon S3 for asset storage, Apache Spark to pre-process the data and Apache Solr for search. This is a huge project with many millions of items indexed.

Relevance is a major concern for this kind of system and Elsevier have developed many strategies for relevance tuning. Features such as highlighting and auto-suggest are used, lemmatisation rather than stemming (with scientific data, stemming can cause issues such as turning ‘Age’ into ‘Ag’ – the chemical symbol for silver) and a custom rescoring algorithm that can be used to promote up to 3 data results to the top of the list if deemed particularly relevant. Elsevier use both search logs and test queries generated by subject matter experts to feed into a custom-built judgement tool – which they are hoping to open source at some point (this would be a great complement to Quepid for test-based relevance tuning)

Peter also described a strategy for automatic optimization of the many query parameters available in Solr, using machine learning, based on some ideas first proposed by Simon Hughes of dice.com. Elsevier have also developed a Phrase Service API, which helps improve phrase based search over the standard un-ordered ‘bag of words’ model by recognising acronyms, chemical formulae, species, geolocations and more, expanding the original phrase based on these terms and then boosting them using Solr’s query parameters. He also mentioned a ‘push API’ available for data providers to push data directly into DataSearch. This was a necessarily brief dive into what is obviously a highly complex and powerful search engine built by Elsevier using many cutting-edge ideas.

Our next speaker, Michael Hardwick of Elite Software, talked about how textual data is stored in PDF files and the implications for extracting this data for search applications. In an engaging (and at some times slightly horrifying) talk he showed how PDFs effectively contain instructions for ‘painting’ characters onto the page and how certain essential text items such as spaces may not be stored at all. He demonstrated how fonts are stored within the PDF itself, how character encodings may be deliberately incorrect to prevent copy-and-paste operations and in general how very little if any semantic information is available. Using newspaper content as an example he showed how reading order is often difficult to extract as the PDF layout is a combination of the text from the original author and how it has been laid out on the page by an editor – so the headline may be have been added after the article text, which itself may have been split up into sections.

Tables in PDFs were described as a particular issue when attempting to extract numerical data for re-use – the data order may not be in the same order as it appears, for example if only part of a table is updated each week a regular publication appears. With PDF files sometimes compressed and encrypted the task of data extraction can become even more difficult. Michael laid out the choices available to those wanting to extract data: optical character recognition, a potentially very expensive Adobe API (that only gives the same quality of output as copy-and-paste), custom code as developed by his company and finally manual retyping, the latter being surprisingly common.

Thanks to both our speakers and our hosts Elsevier – we’re planning another Meetup soon, hopefully in mid to late June.

The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/feed/ 0
Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/ http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/#comments Thu, 18 Feb 2016 11:42:07 +0000 http://www.flax.co.uk/?p=3026 Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at … More

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture: IMAG0507
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/feed/ 2
Elasticsearch Meetup – Spark, postcodes and Couchbase http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/ http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/#respond Fri, 25 Sep 2015 13:30:52 +0000 http://www.flax.co.uk/?p=2669 Three speakers for this month’s Elasticsearch Meetup (slides now up), kindly hosted by JustEat’s technical department. Neil Andrassy kicked us off with a talk about how TheFilter (which you may know counts Peter Gabriel as an investor) use Apache Spark … More

The post Elasticsearch Meetup – Spark, postcodes and Couchbase appeared first on Flax.

]]>
Three speakers for this month’s Elasticsearch Meetup (slides now up), kindly hosted by JustEat’s technical department. Neil Andrassy kicked us off with a talk about how TheFilter (which you may know counts Peter Gabriel as an investor) use Apache Spark to load data into their Elasticsearch cluster. Neil described how Spark and Elasticsearch have superseded both Microsoft SQL and MongoDB – Spark in particular being described as ‘speedy, flexible and componentized’, with Spark’s RDD (Resilient Distributed Datasets) mapping cleanly to Elasticsearch shards. He then showed a demo of UK road accident data being imported into Spark as CSV files, indexed automatically in Elasticsearch and then queried both using Elasticsearch and by Spark’s SQL-like facility. Interestingly, this allows a powerful combination of free text search and relational JOINs to be applied to data in a highly scalable fashion – Spark also features machine learning and streaming data components.

After a quick plug for ElastiCON in London in November, Matt Jones of JustEat described how they have used Elasticsearch’s geolocation search function to improve their handling of restaurant delivery areas. Their previous system only handled the first part of postcodes (e.g ‘SE1’) and they needed finer-grained control of the areas that restaurants were able to deliver to. By indexing polygons representing UK postcode areas and combining these with custom shapes (i.e. a circle representing a maximum delivery distance) they have created a powerful and extendable way to restrict search results. Matt has blogged about this in more detail.

The last talk was by Tom Green of Couchbase, who described how this powerful NoSQL platform is architected and how it can be connected directly to Elasticsearch using its own Cross Data Centre Replication (XDCR) feature. We finished with the usual Q&A during which Mark Harwood responded to my own question on exact facet counts in Elasticsearch with a plea to the industry to be more honest about the limitations of distributed systems – much like the CAP theorem, perhaps we need a similar triangle with vertices of Big Data, Speed and Accuracy – pick two!

Thanks as ever to all the speakers and the hosts, and to Yann Cluchey for organising the Meetup.

The post Elasticsearch Meetup – Spark, postcodes and Couchbase appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/feed/ 0