Search Results for “luwak” – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Activate 2018 day 1 – AI and Search in Montreal http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/ http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/#respond Tue, 30 Oct 2018 13:34:53 +0000 http://www.flax.co.uk/?p=3922 Activate is the successor to the Lucene/Solr Revolution conference that our partner Lucidworks runs every Autumn and was held this year in Montreal, Canada. After running a successful Lucene Hackday on the Monday before the conference, we joined hundreds of … More

The post Activate 2018 day 1 – AI and Search in Montreal appeared first on Flax.

]]>
Activate is the successor to the Lucene/Solr Revolution conference that our partner Lucidworks runs every Autumn and was held this year in Montreal, Canada. After running a successful Lucene Hackday on the Monday before the conference, we joined hundreds of others to hear Will Hayes, the CEO of Lucidworks, explain the new name and direction of the event – it was nice to hear he agrees with me that search is the key to AI. Yoshua Bengio of local AI laboratory MILA followed Will and described some recent breakthroughs in AI including speech recognition, image recognition and went on to talk about Creative AI which can ‘imagine’ new faces after sufficient training. He listed five necessary ingredients for successful machine learning: lots of data, flexible models, enough compute power, computationally efficient inference and powerful prior assumptions to deflect the ‘curse of dimensionality’. These are hard to get right – he told us how even cutting-edge AI is still far from human-level intelligence but can be used to extend human cognitive power. MILA is the greatest concentration of academics working in deep learning in the world and heavily funded by the Canadian government.

I was also pleased to notice our Luwak stored search library mentioned in the handout Bloomberg had placed on every seat!

The talks I attended after the keynote were generally focused on open source, Solr or search topics, but the theme of AI was everywhere. The first talk I went to was about Accenture’s Content Analytics Studio – which looks like a useful tool for building search and analytics applications using a library of widgets and a Python code editor. Unfortunately it wasn’t very clear how one might use this platform, with the presenter eventually admitting that it was a proprietary product but not giving any idea of the price or business model. I would much prefer if presenters were up-front about commercial products, especially as many attendees were from an open source background.

David Smiley‘s talk on Querying Hundreds of Fields at Scale was a lot more interesting: he described how Salesforce run millions of Solr cores and index extremely diverse customer data (as each one can customise their field structure). Using the usual Solr qf operator across possibly 150 fields can lead to thousands of subqueries being generated which also need to be run across each segment. His approach to optimising performance included analysing the input data per field type rather than per field, building a custom segment merge policy and encoding the field type as a term suffix in the term dictionary. Although this uses more CPU time, it improves performance by at least a factor of 10. David hopes to contribute some of this work back to Solr as open source, although much is specific to Salesforce’ use case. This was a fascinating talk about some very clever low-level Lucene techniques.


Next was my favourite talk of the conference – Kevin Watters on the Intersection of Robotics, Search & AI, featuring a completely 3D-printed humanoid robot based on the open source InMoov platform and MyRobotLab software. Kevin has used hundreds of open source projects to add capabilities such as speech recognition, question answering (based on Wikipedia), computer vision, deep learning etc. using a pub/sub architecture. The robot’s ‘memory’ – everything it does, sees, hears and how the various modules interact – is stored in a Solr index. Kevin’s engaging talk showed us examples of how the robot’s search engine powered memory can be used for deep learning, for example for image recognition – in his demo it could be trained to recognise pictures of some Solr commmitters. This really was the crossover between search and AI!

Joel Bernstein then took us through Applied Mathematical Modelling with Apache Solr – describing the ongoing work to integrate the Apache Commons Math library. In particular he showed how these new features can be used for anomaly detection (e.g. an unusually slow network connection) using a simple linear regression model. Solr’s Streaming API can be used to run a constant prediction of the likely response times for sending files of a certain size and any statistically significant differences noted. This is just one example of the powerful features now available for Solr-based analytics – there was more to come in Amrit Sarkar‘s talk afterwards on Building Analytics Applications with Streaming Expressions. Amrit showed a demo (code available here) using Apache Zeppelin where Solr’s various SQL-style operations can be run in parallel for better performance, splitting the job up over a number of worker collections. As the demo imported data directly from a database using a JDBC connector, some of us in the room wondered whether this might be a higher-performing alternative to the venerable (and slow) Data Import Handler…

That was the last talk I saw on Wednesday: that evening was the conference party in a nearby bar, which was a lot of fun (although the massive TV screen showing that night’s hockey game was a little distracting!). I’ll write about day 2 soon: videos of the talks are likely to be available soon on Lucidworks’ Youtube channel and I’ll update this post when they appear.

The post Activate 2018 day 1 – AI and Search in Montreal appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/feed/ 0
Search Solutions 2017 review http://www.flax.co.uk/blog/2017/12/14/search-solutions-2017/ http://www.flax.co.uk/blog/2017/12/14/search-solutions-2017/#respond Thu, 14 Dec 2017 15:33:19 +0000 http://www.flax.co.uk/?p=3652 Search Solutions is one of my favourite search events of the year – small, focused and varied, with presentations from both the largest and smallest players in the world of search, drawn from both industry and academia. This year’s event … More

The post Search Solutions 2017 review appeared first on Flax.

]]>
Search Solutions is one of my favourite search events of the year – small, focused and varied, with presentations from both the largest and smallest players in the world of search, drawn from both industry and academia.

This year’s event started with Edgar Meij of Bloomberg, who Flax have helped in the past with their large-scale search and alerting systems. I’d seen most of the details in this talk before so I won’t dwell on them but will thank Bloomberg again for their commitment and contributions to the open source community, particularly to Solr and our Luwak stored search library. Mark Fea of LexisNexis was up next with a talk about taxonomies and how they have built a semi-automated classification system combining supervised machine learning and Boolean rules-based systems: a pragmatic approach to combine the strengths of both approaches as machine learning isn’t always as clever as one might want, and Boolean rules can be hard to build and maintain. Like Bloomberg they are working at large scale: Mark mentioned taxonomies of 21,000 terms and 9 levels, applied to over 1 billion documents.

Mark Harwood of Elastic was up next with one of his always fascinating talks on discovering unknown patterns in data with Elasticsearch. He showed how he had explored ‘toxic’ content (far-right music and those who like it) and fake reviews on Amazon with some great visual demonstrations. An interesting conclusion was how ‘bad actors’ make strange, recognisable shapes in visualised data. [Mark later won the Best Presentation award, richly deserved!]. Anna Kolliakou of King’s College London spoke next on ‘veracity intelligence’ tools to help monitor terms connected to mental health across news media and social networks: an interesting example was ‘mephedrone’ around the time of reclassification of this particular recreational drug. Next up was independent consultant Phil Bradley with a detailed, well-researched and passionate talk on fake news and how one cannot trust any web search engine to present the full picture. Phil is obviously extremely concerned about this issue and his talk spurred discussion amongst the audience about how user education is essential to counter the usual viewpoint of ‘it’s on Google, it must be true’.

Coincidentally, Filip Radlinski of Google started the next session, describing a model for conversation information retrieval. He spoke about how the user and IR system reveal information about themselves as the conversation progresses, how the system may need a memory of past interactions and how it may present a set of potential answers. This is a useful model for the future, although most current ‘conversational’ systems are simplistic. Fabrizio Silvestri then spoke on the various types of search Facebook provides, mostly related to finding people but also images, video and news. He explained how every search operation needs to consider privacy and how Facebook use query rewriting to expand  enhance the terms provided by the user. Nicola Cancedda of Microsoft was next with a talk on automated query extraction from emails, to help the user find and attach relevant documents in response (for example, after a colleague asks ‘can you send me the cost projections for 2017’). Her work involves training machine learning models after extracting candidate terms with high TF/IDF values from the email. [Interestingly this reminded me of work I carried out nearly 20 years ago on an email signature that when clicked would search for content relevant to the email – although this relied on Javascript working in an email client which is rather a security problem!].

Last of our scheduled talks was from Mark Stanger of Search Technologies (recently acquired by Accenture) about their work on Elsevier’s DataSearch platform. He described how they developed a Phrase Service that identifies phrases in the user’s query using various methods including acronym detection, dictionary lookup and natural language processing, then expands these phrases as necessary to provide enhanced search. After identifying these key terms they can be boosted appropriately for search (DataSearch itself is based on Solr).

The DataSearch project is impressive, and later on it won the Best Search Project award (I am proud to say I served as part of the judging panel for these awards this year). The other winner of most promising search startup Search|hub by CXP Commerce Experts GmbH.

We finished with some lightning talks and a brief Fishbowl session, dominated this time by discussions on Fake News and how it affects the world of search technology. Thanks to the BCS IRSG again for a fascinating and enlightening day.

 

The post Search Solutions 2017 review appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/12/14/search-solutions-2017/feed/ 0
Worth the wait – Apache Kafka hits 1.0 release http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/ http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/#respond Thu, 02 Nov 2017 09:50:20 +0000 http://www.flax.co.uk/?p=3629 We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple … More

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple but powerful concepts – everything it deals with is a stream of data (like a messaging system), streams can be combined for processing and stored reliably in a highly fault-tolerant way. It’s also massively scalable.

For search applications, Kafka is a great choice for the ‘wiring’ between source data (databases, crawlers, flat files, feeds) and the search index and other parts of the system. We’ve used other message passing systems (like RabbitMQ) in projects before, but none have the simplicity and power of Kafka. Combine the search index with analysis and visualisation tools such as Kibana and you can build scalable, real-time systems for ingesting, storing, searching and analysing huge volumes of data – for example, we’ve already done this for clients in the financial sector wanting to monitor log data using open-source technology, rather than commercial tools such as Splunk.

The development of Kafka has been masterminded by our partners Confluent, and it’s a testament to this careful management that the milestone 1.0 version has only just appeared. This doesn’t mean that previous versions weren’t production ready – far from it – but it’s a sign that Kafka has now matured to be a truly enterprise-scale project. Congratulations to all the Kafka team for this great achievement.

We look forward to working more with this great software – and if you need help with your Kafka project do get in touch!

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/feed/ 0
Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/ http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/#respond Thu, 28 Sep 2017 08:44:26 +0000 http://www.flax.co.uk/?p=3613 I finally made it to a London Elastic Meetup again after missing a few of the recent events: this time Rightmove were the hosts and the first speakers. They described how they had used Elasticsearch Percolator to run 3.5 million … More

The post Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch appeared first on Flax.

]]>
I finally made it to a London Elastic Meetup again after missing a few of the recent events: this time Rightmove were the hosts and the first speakers. They described how they had used Elasticsearch Percolator to run 3.5 million stored searches on new property listings as part of an overall migration from the Exalead search engine and Oracle database to a new stack based on Elasticsearch, Apache Kafka and CouchDB. After creating a proof-of-concept system on Amazon’s cloud they discovered that simply running all 3.5m Percolator queries every time a new property appeared would be too slow and thus implemented a series of filters to cut down the number of queries applied, including filtering out rental properties and those in the wrong location. They are now running around 40m saved searches per day and also plan to upgrade from their current Elasticsearch 2.4 system to the newer version 5, as well as carry out further performance improvements. After the talk I chatted to the presenter George Theofanous about our work for Bloomberg using our own library Luwak, which could be an way for Rightmove to run stored searches much more efficiently.

Next up was Signal Media, describing how they built an automated system for upgrading Elasticsearch after their cluster grew to over 60 nodes (they ingest a million articles a day and up to May 2016 were running on Elasticsearch 1.5 which had a number of issues with stability and performance). To avoid having to competely shut down and upgrade their cluster, Joachim Draeger described how they carried out major version upgrades by creating a new, parallel cluster (he named this the ‘blue/green’ method), with their indexing pipeline supplying both clusters and their UI code being gradually switched over to the new cluster once stability and performance were verified. This process has cut their cluster to only 23 nodes with a 50% cost saving and many performance and stability benefits. For ongoing minor version changes they have built an automated rolling upgrade system using two Amazon EBS volumes for each node (one is for the system, and is simply switched off as a node is disabled, the other is data and is re-attached to a new node once it is created with the upgraded Elasticsearch machine image). With careful monitoring of cluster stability and (of course) testing, this system enables them to upgrade their entire production cluster in a safe and reliable way without affecting their customers.

After the talks I announced the Search Industry Awards I’ll be helping to judge in November (please apply if you have a suitable search project or innovation!) and then spoke to Simone Scarduzio about his free Elasticsearch and Kibana security plugin, a great alternative to the Elastic X-Pack (only available to Elastic subscription customers). We’ll certainly be taking a deeper look at this plugin for our own clients.

Thanks again to Yann Cluchey for organising the event and all the speakers and hosts.

The post Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/feed/ 0
ECIR 2017 Industry Day, our book & a demo of live TV factchecking http://www.flax.co.uk/blog/2017/04/24/ecir-2017-industry-day-book-demo-live-tv-factchecking/ http://www.flax.co.uk/blog/2017/04/24/ecir-2017-industry-day-book-demo-live-tv-factchecking/#respond Mon, 24 Apr 2017 13:45:44 +0000 http://www.flax.co.uk/?p=3463 I visited Aberdeen before Easter to speak at Industry Day, a part of the European Conference on Information Retrieval. Following a reception at Aberdeen’s Town House (a wonderful building) hosted by the Lord Provost I spent an evening with various … More

The post ECIR 2017 Industry Day, our book & a demo of live TV factchecking appeared first on Flax.

]]>
I visited Aberdeen before Easter to speak at Industry Day, a part of the European Conference on Information Retrieval. Following a reception at Aberdeen’s Town House (a wonderful building) hosted by the Lord Provost I spent an evening with various information retrieval luminaries including Professor Udo Kruschwitz of the University of Essex. We had a chance to discuss the book we’re co-authoring (draft title ‘Searching the Enterprise’, designed as a review of the subject for those considering a PhD or those in business wanting to know the current state of the art – it should be out later this year) and I also caught up with our associate Tony Russell-Rose of UXLabs.

Industry Day started with a talk by Peter Mika of Norwegian media group Schibsted on modelling user behaviour for delivering personalised news. It was interesting to hear his views on Facebook and the recent controversy about their removal of a photo posted by a Schibsted group newspaper, and how this might be a reason Schibsted carry out their own internal developments rather than relying on the algorithms used by much larger companies. Edgar Meij was up next talking about search at Bloomberg (which we’ve been involved in) and it was interesting to hear that they might be contributing some of their alerting infrastructure back to Apache Lucene/Solr. James McMinn of startup Scoop Analytics followed, talking about real time news monitoring. They have built a prototype system based on PostgresSQL rather than a search engine, indexing around half a billion tweets, that allows one to spot breaking news much earlier than the main news outlets might report it.

The next session started with Michaela Regneri of OTTO on Newsleak.io, a project in collaboration with Der Speigel “producing a piece of software that allows to quickly and intuitively explore large amounts of textual data”. She stressed how important it is to have a common view of what is ‘good’ performance in collaborative projects like this. Richard Boulton (who worked at Flax many years ago) was next in his role as Head of Software Engineering at the Government Digital Service, talking about the ambitious project to create a taxonomy for all UK government content. So far, his team have managed to create an alpha version of this for educational content – not that they don’t have the time or resources in-house to tag content, so must therefore work with the relevant departments to do so. They have created various software tools to help including an automatic topic tagger using Latent Dirichlet Allocation – which given this is the GDS, is of course open source and available.

Unfortunately I missed a session after this due to a phone call, but managed to catch some of Elizabeth Daly of IBM talking about automatic claim detection using the Watson framework. Using Wikipedia as a source, this can identify statements that support a particular claim for an argument and tag them as ‘pro’ or ‘con’. This topic led neatly on to Will Moy of Full Fact who we have been working with recently, in a ‘sandwich’ session with myself. Will talked about how Full Fact has been working for many years to develop neutral, un-biased factchecking tools and services and I then spoke about the hackday we ran recently for FullFact and particularly about our Luwak library and how it can be used to spot known claims by politicians in streaming news. Will then surprised me and impressed the audience by showing a prototype service that watches several UK television channels in real time, extracts the subtitles and checks them against a list of previously factchecked claims – using the Luwak backend we built at the hackday. Yes, that’s live factchecking of television news, very exciting!

Thanks to Professor Kruschwitz and Tony Russell-Rose for putting together the agenda and inviting both me and Will to speak – it was great to be able to talk about the exciting work we’re doing with Full Fact and to hear about the other projects.

The post ECIR 2017 Industry Day, our book & a demo of live TV factchecking appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/04/24/ecir-2017-industry-day-book-demo-live-tv-factchecking/feed/ 0
A fabulous FactHack for Full Fact http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/ http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/#respond Fri, 27 Jan 2017 10:49:20 +0000 http://www.flax.co.uk/?p=3412 Last week we ran a hackday for Full Fact, hosted by Facebook in their London office. We had planned to gather a room full of search experts from our London Lucene/Solr Meetup and around twenty people attended from a range … More

The post A fabulous FactHack for Full Fact appeared first on Flax.

]]>
Last week we ran a hackday for Full Fact, hosted by Facebook in their London office. We had planned to gather a room full of search experts from our London Lucene/Solr Meetup and around twenty people attended from a range of companies including Bloomberg, Alfresco and the European Bioinformatics Institute, including a number of Lucene/Solr committers.

Mevan Babakar of Full Fact has already written a detailed review of the day, but to summarise we worked on three areas:

  • Building a web service around our Luwak stored query engine, to give it an easy-to-use API. We now have an early version of this which allows Full Fact to check claims they have previously fact checked against a stream of incoming data (e.g. subtitles or transcripts of political events).
  • Creating a way to extract numbers from text and turn them into a consistent form (e.g. ‘eleven percent’, ‘11%’, ‘0.11’) so that we can use range queries more easily – Derek Jones’ team researched existing solutions and he has blogged about what they achieved.
  • Investigating how to use natural language processing to identify parts of speech and tag them in a Lucene index using synonyms and token stacking, to allow for queries such as ‘<noun> is rising’ to match text like ‘crime is rising’ – the team forked Lucene/Solr to experiment with this.

We’re hoping to build on these achievements to continue to support Full Fact as they develop open source automated fact checking tools for both their own operations and for other fact checking organisations across the world (there were fact checkers from Argentina and Africa attending to give us an international perspective). Our thanks to all of those who contributed.

I’ve also introduced Full Fact to many others within the search and text analytics community and we would welcome further contributions from anyone who can lend their expertise and time – get in touch if you can help. This is only the beginning!

The post A fabulous FactHack for Full Fact appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/feed/ 0
Just the facts with Solr & Luwak http://www.flax.co.uk/blog/2017/01/04/just-facts-solr-luwak/ http://www.flax.co.uk/blog/2017/01/04/just-facts-solr-luwak/#respond Wed, 04 Jan 2017 15:58:19 +0000 http://www.flax.co.uk/?p=3406 It won’t have escaped your notice that factchecking is very much in the news recently due to last year’s political upheavals in both the US and UK and the suspected influence of fake news on voters. Both traditional and social … More

The post Just the facts with Solr & Luwak appeared first on Flax.

]]>
It won’t have escaped your notice that factchecking is very much in the news recently due to last year’s political upheavals in both the US and UK and the suspected influence of fake news on voters. Both traditional and social media organisations are making efforts in this area; examples include Channel 4 and Facebook.

At our recent London Lucene/Solr Meetup UK charity Full Fact spoke eloquently on the need for automated factchecking tools to help identify and correct stories that are demonstrably false. They’ve also published a great report on The State of Automated Factchecking which mentions both Apache Solr and our powerful stored query library Luwak as components of their platform. We’ve been helping FullFact with their prototype factchecking tools for a while now but during the Meetup I suggested we might run a hackday to develop these further.

Thus I’m very pleased to announce that Facebook have offered us a venue in London for the hackday on January 20th (register here). Many Solr developers, including several committers and PMC members, are signed up to attend already. We’ll use Full Fact’s report and their experiences of factchecking newspapers, TV’s Question Time and Hansard to design and build practical, useful tools and identify a future roadmap. We’ll aim to publish what we build as open source software which should also benefit factchecking organisations across the world.

If you’re concerned about the impact of fake news on the political process and want to help, join the Meetup and/or donate to Full Fact.

The post Just the facts with Solr & Luwak appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/01/04/just-facts-solr-luwak/feed/ 0
Meetup at Big Data London – One-click Solr & Factchecking with Solr http://www.flax.co.uk/blog/2016/11/10/meetup-big-data-london-one-click-solr-factchecking-solr/ http://www.flax.co.uk/blog/2016/11/10/meetup-big-data-london-one-click-solr-factchecking-solr/#respond Thu, 10 Nov 2016 11:22:26 +0000 http://www.flax.co.uk/?p=3381 Last week I spoke at the Big Data London conference, a very busy event with several thousand people attending. My session was on using open source search to make sense of Big Data – you can get slides here. In … More

The post Meetup at Big Data London – One-click Solr & Factchecking with Solr appeared first on Flax.

]]>
Last week I spoke at the Big Data London conference, a very busy event with several thousand people attending. My session was on using open source search to make sense of Big Data – you can get slides here.

In the evening we ran another Lucene/Solr London Usergroup event with speakers Upayavira and Full Fact. After a brief but friendly fight with the Datastax team over pizza we settled down to see Upayavira show us his method for creating a fully functional SolrCloud stack and search application with a single command line using tools such as Docker, Rancher and Exhibitor. Upayavira’s system only needs to be given details of an Amazon Web Services cloud hosting account and it will create host instances, install and start Zookeeper, wait for a quorum to be established, install and start Solr and create a SolrCloud cluster and finally install and start a search application. The whole thing is managed by his own script Uberstack and is undeniably impressive.

Our second talk (and I think my favourite talk from all our Solr Meetups) was from Will Moy and Mevan Babakar of Full Fact, a charity who monitor the news for accuracy (something we increasingly require in these ‘post-truth’ days). Will told us how false and misleading claims can be amplified by the media and may end up directly influencing government policy, even though the underlying facts are wrong. FullFact are attempting to build open source, freely available systems for automating the factchecking process using Apache Lucene/Solr and our own stored query library Luwak and Flax have been donating some time to help them with this process. Their Hawk system currently indexes over 70 million sentences. This project is a wonderful example of how free, open source software can be used to create tools that benefit us all and at the end of this inspiring talk many of the audience offered ideas and even direct assistance with the project. I urge you to read Full Fact’s recent report on automated factchecking and get involved if you can. One idea was to run a Hackday for Full Fact – more details when we have them.

Thanks to Big Data London for inviting me to speak and hosting the Meetup and to Elsevier for sponsoring pizza and drinks. We’ll be back with another Meetup soon!

The post Meetup at Big Data London – One-click Solr & Factchecking with Solr appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/11/10/meetup-big-data-london-one-click-solr-factchecking-solr/feed/ 0
Out with the old – and in with the new Lucene query parser? http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/#respond Fri, 13 May 2016 12:41:52 +0000 http://www.flax.co.uk/?p=3273 Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of … More

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of stored expressions exist and it is difficult or uneconomic to translate them all by hand. Our usual approach is to write a query parser, which understands the current syntax but creates a query suitable for a modern open source search engine based on Apache Lucene. We’ve done this for legacy engines including dtSearch and Verity and also for in-house query languages developed by clients themselves. This allows you to keep the existing syntax but improve performance, scalability and accuracy of your search engine.

There are a few points to note during this process:

  • What appears to be a simple query in your current language may not translate to a simple Lucene query, which may lead to performance issues if you are not careful. Wildcards for example can be very expensive to process.
  • You cannot guarantee that the new search system will return exactly the same results, in the same order, as the old one, no matter how carefully the query parser is designed. After all, the underlying search engine algorithms are different.
  • Some element of manual translation may be necessary for particularly large, complex or unusual queries, especially if the original intention of the person who wrote the query is unclear.
  • You may want to create a vendor-neutral query language as an intermediate step – so you can migrate more easily next time. We’ve done this for Danish media monitors Infomedia.
  • If you have particularly large and/or complex queries that may have been added to incrementally over time, they may contain errors or logistical inconsistencies – which your current engine may not be telling you about! If you find these you have two choices: fix the query expression (which may then give you slightly different results) or make the new system give the same (incorrect) results as before.

To mitigate these issues it is important to decide on a test set of queries and expected results, and what level of ‘correctness’ is required – bearing in mind 100% is going to be difficult if not impossible. If you are dealing with languages outside the experience of the team you should also make sure you have access to a native speaker – so you can be sure that results really are relevant!

Do let us know if you’re planning this kind of migration and how we can help – building Lucene query parsers is not a simple task and some past experience can be invaluable.

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/feed/ 0
Apache Kafka London Meetup – Real time search and insights http://www.flax.co.uk/blog/2016/04/14/apache-kafka-london-meetup-real-time-search-insights/ http://www.flax.co.uk/blog/2016/04/14/apache-kafka-london-meetup-real-time-search-insights/#respond Thu, 14 Apr 2016 09:50:05 +0000 http://www.flax.co.uk/?p=3202 The rise of Apache Kafka as a streaming data solution is something we’ve been watching for a while – as part of a collection of Big Data tools, it provides a ‘TiVo for data‘ feature. We’ve begun to use it … More

The post Apache Kafka London Meetup – Real time search and insights appeared first on Flax.

]]>
The rise of Apache Kafka as a streaming data solution is something we’ve been watching for a while – as part of a collection of Big Data tools, it provides a ‘TiVo for data‘ feature. We’ve begun to use it in client projects covering both search and log analysis and we’ve recently partnered with Confluent, founded by the creators of Kafka.

Last night we spoke at the Apache Kafka London Meetup – hosted by British Gas Connected Homes, it was well supplied with drinks, pizza and snacks and also very well attended – there was a great buzz of conversation before the talks had even started! Alan Woodward of Flax started with an updated talk about our proof-of-concept integration of Kafka, Apache Samza and our own Luwak streaming search library (slides are available here). This allows full-text search within a Kafka stream, with the search queries supplied as another stream, for a truly real-time solution – as opposed to the more usual (and much higher latency) approach of indexing the endpoint of a stream. Alan has also tried the very new Kafka Streams feature which can be used as an alternative to Apache Samza – there is some very early code available, although note that this still needs some work! (We’ll update this blog when it’s finished).

The second talk was by one of our hosts, Josep Casals, on how British Gas have used Kafka, Spark Streaming and Apache Cassandra to build a platform for analyzing data from smart meters, boilers and thermostats. Over 2 million smart meters are installed across the UK and there are also over 300,000 connected thermostats, plus many other data sources, and these devices can report every 30 minutes and 2 minutes respectively, so their system has to cope with around 30,000 messages/second. One interesting feature for me was how machine learning is used to disaggregrate power consumption data, so the consumption for say, a fridge can be split out from the overall figure. Apache Samza is also used in this system to provide estimates of consumption and interpolate between readings, allowing data to be fed back to an app on the customer’s mobile device. Further use cases include spotting outlier events, which might indicate failing heating devices or even unusual patterns in an elderly person’s home to alert relatives or carers.

Both talks were live streamed and you can watch them here.

We concluded with some informal discussion and a chance to meet some of Confluent’s UK-based team. Thanks to the organisers and hosts and we look forward to returning! If you have a Kafka project and you’d like any help or advice, do let us know.

The post Apache Kafka London Meetup – Real time search and insights appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/14/apache-kafka-london-meetup-real-time-search-insights/feed/ 0