lucene – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Flax joins OpenSource Connections http://www.flax.co.uk/blog/2018/12/21/flax-joins-opensource-connections/ http://www.flax.co.uk/blog/2018/12/21/flax-joins-opensource-connections/#respond Fri, 21 Dec 2018 12:09:24 +0000 http://www.flax.co.uk/?p=4017 We have some news! From February 1st 2019 Flax’s Managing Director Charlie Hull will be joining OpenSource Connections (OSC), Flax’s long-standing US partner, as a senior Managing Consultant. Charlie will manage a new UK division of OSC who will also … More

The post Flax joins OpenSource Connections appeared first on Flax.

]]>
We have some news!

From February 1st 2019 Flax’s Managing Director Charlie Hull will be joining OpenSource Connections (OSC), Flax’s long-standing US partner, as a senior Managing Consultant. Charlie will manage a new UK division of OSC who will also acquire some of Flax’s assets and brands. OSC are a highly regarded organisation in the world of search and relevance, wrote the seminal book Relevant Search and run the popular Haystack relevance conference. Their clients include the US Patent Office, the Wikimedia Foundation and Under Armour and their services include comprehensive training, Discovery engagements, Trusted Advisor consulting and expert implementation.

Lemur Consulting Ltd., which as most of you will know trades as Flax, will continue to operate and to complete current projects but will not be taking on any new business after January 2019. For any new business we will be forwarding all future Flax enquiries to OSC where Charlie will as ever be very happy to discuss requirements and how OSC’s expert team (which may include some familiar faces!) might help.

We are all very excited about this new development as it will create a larger team of independent search & relevance experts with a global reach. We fully expect to build on Flax’s 17 year history of providing high quality search solutions as part of OSC. We intend to continue managing the London Lucene/Solr Meetup and running, attending and speaking at other events on search related topics.

If you have any questions about the above please do contact us. Merry Christmas and best wishes for the New Year!

The post Flax joins OpenSource Connections appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/12/21/flax-joins-opensource-connections/feed/ 0
Activate 2018 day 2 – AI and Search in Montreal http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/ http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/#respond Wed, 07 Nov 2018 12:09:38 +0000 http://www.flax.co.uk/?p=3983 I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such … More

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

]]>
I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such as diversity and inclusion raised in a keynote talk. Note that videos of some of the talks is starting to appear on Lucidworks’ Youtube channel.

Steve Rowe of Lucidworks gave a talk on what’s coming in Lucene/Solr 8 – a long list of improvements and new features from 7.x releases including autoscaling of SolrCloud clusters, better cross-datacentre replication (CDCR), time routed index aliases for time-series data, new replica types, streaming expressions, a JSON query DSL, better segment merge policies..it’s clear that a huge amount of work continues to go into Solr. In 8.x releases we’ll hopefully see HTTP/2 capability for faster throughput and perhaps Luke, the Lucene Index Toolbox, becoming part of the main project.

Cassandra Targett, also of Lucidworks, spoke about the Lucene/Solr Reference Guide which is now actually part of Solr’s source code in Asciidoc format. She had attempted to build this into a searchable, fully-hyperlinked documentation source using Solr itself but this quickly ran into issues with HTML tags and maintaining correct links. Lucidworks’ own Site Search did a lot better but the result still wasn’t perfect. Work remains to be done here but encouragingly in the last few weeks there’s also been some thinking about how to better document Solr’s huge and complex test suite on SOLR-12930. As Cassandra mentioned, effective documentation isn’t always the focus of Solr committers, but it’s essential for Solr users.

The next talk I caught came from Andrzej Bialecki on Solr’s autoscaling functionality and some impressive testing he’s done. Autoscaling analyzes your Solr cluster and makes suggestions about how to restructure it – which you can then do manually or automatically using other Solr features. These features are generally tested on collections of 1 billion documents – but Andrzej has manually tested them on 1 trillion simulated documents (yes, you read that right). Now that’s some scale!

The final talk I caught before the closing keynote was Chris ‘Hossman’ Hosstetter on How to be a Solr Contributor, amusingly peppered with profanity as is his usual style. There were a number of us in the room with some small concerns about Solr patches that have not been committed, and in general about how Solr might need more committers and how this might happen, but the talk mainly focused on how to generate new patches. He also mentioned how new features can have an unexpected cost, as they must then be maintained and might have totally unexpected consequences for other parts of the platform. Some of the audience raised questions about Solr tests (some of which regularly fail) – however since the conference Mark Miller has taken the lead on this under SOLR-12801 which is encouraging.

The closing keynote by Trey Grainger brought together the threads of search and AI – and also mentioned that if anyone had some spare server capacity, it would be fun to properly test Solr at trillion-document scale…

So in conclusion how did Activate compare to its previous incarnation as Lucene/Solr Revolution? Is search really the foundation of AI? Well, the talks I attended mainly focused on Solr features, but various colleagues heard about machine learning, learning-to-rank and self-aware machines, all of which is becoming easier to implement using Lucene/Solr. However, as Doug Turnbull writes if you’re thinking of a AI for search, you should be wary of the potential cost and complexity. There are no magic robots (Kevin Watters’ robot however, is rather wonderful!).

Huge thanks must go to all at Lucidworks for putting on such a well-organised and thought-provoking event and bringing together so many Lucene/Solr enthusiasts.

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/feed/ 0
Activate 2018 day 1 – AI and Search in Montreal http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/ http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/#respond Tue, 30 Oct 2018 13:34:53 +0000 http://www.flax.co.uk/?p=3922 Activate is the successor to the Lucene/Solr Revolution conference that our partner Lucidworks runs every Autumn and was held this year in Montreal, Canada. After running a successful Lucene Hackday on the Monday before the conference, we joined hundreds of … More

The post Activate 2018 day 1 – AI and Search in Montreal appeared first on Flax.

]]>
Activate is the successor to the Lucene/Solr Revolution conference that our partner Lucidworks runs every Autumn and was held this year in Montreal, Canada. After running a successful Lucene Hackday on the Monday before the conference, we joined hundreds of others to hear Will Hayes, the CEO of Lucidworks, explain the new name and direction of the event – it was nice to hear he agrees with me that search is the key to AI. Yoshua Bengio of local AI laboratory MILA followed Will and described some recent breakthroughs in AI including speech recognition, image recognition and went on to talk about Creative AI which can ‘imagine’ new faces after sufficient training. He listed five necessary ingredients for successful machine learning: lots of data, flexible models, enough compute power, computationally efficient inference and powerful prior assumptions to deflect the ‘curse of dimensionality’. These are hard to get right – he told us how even cutting-edge AI is still far from human-level intelligence but can be used to extend human cognitive power. MILA is the greatest concentration of academics working in deep learning in the world and heavily funded by the Canadian government.

I was also pleased to notice our Luwak stored search library mentioned in the handout Bloomberg had placed on every seat!

The talks I attended after the keynote were generally focused on open source, Solr or search topics, but the theme of AI was everywhere. The first talk I went to was about Accenture’s Content Analytics Studio – which looks like a useful tool for building search and analytics applications using a library of widgets and a Python code editor. Unfortunately it wasn’t very clear how one might use this platform, with the presenter eventually admitting that it was a proprietary product but not giving any idea of the price or business model. I would much prefer if presenters were up-front about commercial products, especially as many attendees were from an open source background.

David Smiley‘s talk on Querying Hundreds of Fields at Scale was a lot more interesting: he described how Salesforce run millions of Solr cores and index extremely diverse customer data (as each one can customise their field structure). Using the usual Solr qf operator across possibly 150 fields can lead to thousands of subqueries being generated which also need to be run across each segment. His approach to optimising performance included analysing the input data per field type rather than per field, building a custom segment merge policy and encoding the field type as a term suffix in the term dictionary. Although this uses more CPU time, it improves performance by at least a factor of 10. David hopes to contribute some of this work back to Solr as open source, although much is specific to Salesforce’ use case. This was a fascinating talk about some very clever low-level Lucene techniques.


Next was my favourite talk of the conference – Kevin Watters on the Intersection of Robotics, Search & AI, featuring a completely 3D-printed humanoid robot based on the open source InMoov platform and MyRobotLab software. Kevin has used hundreds of open source projects to add capabilities such as speech recognition, question answering (based on Wikipedia), computer vision, deep learning etc. using a pub/sub architecture. The robot’s ‘memory’ – everything it does, sees, hears and how the various modules interact – is stored in a Solr index. Kevin’s engaging talk showed us examples of how the robot’s search engine powered memory can be used for deep learning, for example for image recognition – in his demo it could be trained to recognise pictures of some Solr commmitters. This really was the crossover between search and AI!

Joel Bernstein then took us through Applied Mathematical Modelling with Apache Solr – describing the ongoing work to integrate the Apache Commons Math library. In particular he showed how these new features can be used for anomaly detection (e.g. an unusually slow network connection) using a simple linear regression model. Solr’s Streaming API can be used to run a constant prediction of the likely response times for sending files of a certain size and any statistically significant differences noted. This is just one example of the powerful features now available for Solr-based analytics – there was more to come in Amrit Sarkar‘s talk afterwards on Building Analytics Applications with Streaming Expressions. Amrit showed a demo (code available here) using Apache Zeppelin where Solr’s various SQL-style operations can be run in parallel for better performance, splitting the job up over a number of worker collections. As the demo imported data directly from a database using a JDBC connector, some of us in the room wondered whether this might be a higher-performing alternative to the venerable (and slow) Data Import Handler…

That was the last talk I saw on Wednesday: that evening was the conference party in a nearby bar, which was a lot of fun (although the massive TV screen showing that night’s hockey game was a little distracting!). I’ll write about day 2 soon: videos of the talks are likely to be available soon on Lucidworks’ Youtube channel and I’ll update this post when they appear.

The post Activate 2018 day 1 – AI and Search in Montreal appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/10/30/activate-2018-day-1-ai-and-search-in-montreal/feed/ 0
Lucene Hackdays in London & Montreal http://www.flax.co.uk/blog/2018/10/23/lucene-hackdays-in-london-montreal/ http://www.flax.co.uk/blog/2018/10/23/lucene-hackdays-in-london-montreal/#respond Tue, 23 Oct 2018 09:35:13 +0000 http://www.flax.co.uk/?p=3919 We ran a couple of Lucene Hackdays over the last couple of weeks: a chance to get together with other people working on open source search, learn from each other and to try and improve both Lucene and associated software. … More

The post Lucene Hackdays in London & Montreal appeared first on Flax.

]]>
We ran a couple of Lucene Hackdays over the last couple of weeks: a chance to get together with other people working on open source search, learn from each other and to try and improve both Lucene and associated software.

Our first Hackday was in London, hosted by Mimecast at their offices near Moorgate. Despite a fire alarm practice (during which we ended up under some flats at the Barbican, whose residents may have been a little surprised at quite how many people ended up milling around under their balconies) we had a busy day – we split into three groups to look at tools for inspecting Lucene indexes, various outstanding bugs and issues with Lucene and Solr and to review a well-known issue where different Solr replicas can provide slightly different result ordering. By 5.30 p.m. when we were scheduled to finish we were still frantically hacking on some last-minute Javascript to add a feature to our Marple index inspector – luckily a few minutes later to a collective sigh of relief we had it working and we repaired to a local pub for food and drink (kindly sponsored by Elastic).

The next week a number of us were in Montreal for the Activate conference (previously known as Lucene/Solr Revolution but now sprinkled with cutting-edge AI fairy dust!). Our second Hackday was hosted by Netgovern and we worked on various Lucene/Solr issues, some improvements to our Harahachibu proxy (which attempts to block Solr updates when disk space is low) and discussed in depth how to improve the Solr onboarded experience. Pizza (sponsored by OneMoreCloud) and coffee fueled the hacking and we also added some new features including a Query Parser for MinHash queries. Many Lucene/Solr committers attended and afterwards we met up for a drink & food nearby (thanks to Searchstax for sponsoring this!) where we were joined by a few others – including Yonik Seeley, creator of Solr.

Next it was time for Activate – of which more later! Thanks to everyone who attended – you can see some notes and links about what we worked on here. Work will be continuing on these issues I’m sure.

The post Lucene Hackdays in London & Montreal appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/10/23/lucene-hackdays-in-london-montreal/feed/ 0
Three weeks of search events this October from Flax http://www.flax.co.uk/blog/2018/09/04/three-weeks-of-search-events-this-october-from-flax/ http://www.flax.co.uk/blog/2018/09/04/three-weeks-of-search-events-this-october-from-flax/#respond Tue, 04 Sep 2018 10:11:56 +0000 http://www.flax.co.uk/?p=3891 Flax has always been very active at conferences and events – we enjoy meeting people to talk about search! With much of our consultancy work being carried out remotely these days, attending events is a great way to catch up … More

The post Three weeks of search events this October from Flax appeared first on Flax.

]]>
Flax has always been very active at conferences and events – we enjoy meeting people to talk about search! With much of our consultancy work being carried out remotely these days, attending events is a great way to catch up in person with our clients, colleagues and peers and to learn from others about what works (and what doesn’t) when building cutting-edge search solutions. I’m thus very glad to announce that we’re running three search events this coming October.

Earlier in the year I attended Haystack in Charlottesville, one of my favourite search conferences ever – and almost immediately began to think about whether we could run a similar event here in Europe. Although we’ve only had a few months I’m very happy to say we’ve managed to pull together a high-quality programme of talks for our first Haystack Europe event, to be held in London on October 2nd. The event is focused on search relevance from both a business and a technical perspective and we have speakers from global retailers and by specialist consultants and authors. Tickets are already selling well and we have limited space, so I would encourage you to register as soon as you can (Haystack USA sold out even after the capacity was increased). We’re running the event in partnership with Open Source Connections.

The next week we’re running a Lucene Hackday on October 9th as part of our London Lucene/Solr Meetup programme. Building on previous successful events, this is a day of hacking on the Apache Lucene search engine and associated software such as Apache Solr and Elasticsearch. You can read up on what we achieved at our last event a couple of years ago – again, space is limited, so sign up soon to this free event (huge thanks to Mimecast for providing the venue and to Elastic for sponsoring drinks and food for an evening get-together afterwards). Bring a laptop and your ideas (and do comment on the event page if you have any suggestions for what we should work on).

We’ll be flying to Montreal soon afterwards to attend the Activate conference (run by our partners Lucidworks) and while we’re there we’ll host another free Lucene Hackday on October 15th. Again, this would not be possible without sponsorship and so thanks must go to Netgovern, SearchStax and One More Cloud. Remember to tell us your ideas in the comments.

So that’s three weeks of excellent search events – see you there!

The post Three weeks of search events this October from Flax appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/09/04/three-weeks-of-search-events-this-october-from-flax/feed/ 0
Defining relevance engineering, part 1: the background http://www.flax.co.uk/blog/2018/06/25/defining-relevance-engineering-part-1-the-background/ http://www.flax.co.uk/blog/2018/06/25/defining-relevance-engineering-part-1-the-background/#respond Mon, 25 Jun 2018 10:40:12 +0000 http://www.flax.co.uk/?p=3838 Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this … More

The post Defining relevance engineering, part 1: the background appeared first on Flax.

]]>
Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

Let’s start by turning the clock back a few years. Ten or fifteen years ago search engines were usually closed source, mysterious black boxes, costing five or six-figure sums for even relatively modest installations (let’s say a couple of million documents – small by today’s standards). Huge amounts of custom code were necessary to integrate them with other systems and projects would take many months to demonstrate even basic search functionality. The trick was to get search working at all, even if the eventual results weren’t very relevant. Sadly even this was sometimes difficult to achieve.

Nowadays, search technology has become highly commoditized and many developers can build a functioning index of several milion documents in a couple of days with off-the-shelf, open source, freely available software. Even the commercial search firms are using open source cores – after all, what’s the point of developing them from scratch? Relevance is often ‘good enough’ out of the box for non business-critical applications.

A relevance engineer is required when things get a little more complicated and/or when good search is absolutely critical to your business. If you’re trading online, search can be a major driver of revenue and getting it wrong could cost you millions. If you’re worried about complying with the GDPR, MiFID or other regulations then ‘good enough’ simply isn’t if you want to prevent legal issues. If you’re serious about saving the time and money your employees waste looking for information or improving your business’ ability to thrive in a changing world then you need to do search right.

So what search engine should you choose before you find a relevance engineer to help with it? I’m going to go out on a limb here and say it doesn’t actually matter that muchAt Flax we’re proponents of open source engines such as Apache Lucene/Solr and Elasticsearch (which have much to recommend them) but the plain fact is that most search engines are the same under the hood. They all use the same basic principles of information retrieval; they all build indexes of some kind; they all have to analyze the source data and user queries in much the same way (ignore ‘cognitive search’ and other ‘AI’ buzzwords for now, most of this is marketing rather than actual substance). If you’re using Microsoft Sharepoint across your business we’re not going to waste your time trying to convince you to move wholesale to a Linux-based open source alternative.

Any modern search engine should allow you the flexibility to adjust how data is ingested, how it is indexed, how queries are processed and how ranking is done. These are the technical tools that the relevance engineer can use to improve search quality. However, relevance engineering is never simply a technical task – in fact, without a business justification, adjusting these levers may make things worse rather than better.

In the next post I’ll cover how a relevance engineer can engage with a business to discover the why of relevance tuning. In the meantime you can read Doug Turnbull’s chapter in the free Search Insights 2018 report by the Search Network (the rest of the report is also very useful) and you might also be interested in the ‘Think like a relevance engineer’ training he is running soon in the USA. Of course, feel free to contact us for details of similar UK or EU-based training or if you need help with relevance engineering.

The post Defining relevance engineering, part 1: the background appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/06/25/defining-relevance-engineering-part-1-the-background/feed/ 0
London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/ http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/#respond Thu, 03 May 2018 09:47:48 +0000 http://www.flax.co.uk/?p=3812 Elsevier were our kind hosts for the latest London Lucene/Solr Meetup and also provided the first speaker, Peter Cotroneo. Peter spoke about their DataSearch project, a search engine for scientific data. After describing how most other data search engines only … More

The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.

]]>
Elsevier were our kind hosts for the latest London Lucene/Solr Meetup and also provided the first speaker, Peter Cotroneo. Peter spoke about their DataSearch project, a search engine for scientific data. After describing how most other data search engines only index and rank results using metadata, Peter showed how Elsevier’s product indexes the data itself and also provides detailed previews. DataSearch uses Apache NiFi to connect to the source repositories, Amazon S3 for asset storage, Apache Spark to pre-process the data and Apache Solr for search. This is a huge project with many millions of items indexed.

Relevance is a major concern for this kind of system and Elsevier have developed many strategies for relevance tuning. Features such as highlighting and auto-suggest are used, lemmatisation rather than stemming (with scientific data, stemming can cause issues such as turning ‘Age’ into ‘Ag’ – the chemical symbol for silver) and a custom rescoring algorithm that can be used to promote up to 3 data results to the top of the list if deemed particularly relevant. Elsevier use both search logs and test queries generated by subject matter experts to feed into a custom-built judgement tool – which they are hoping to open source at some point (this would be a great complement to Quepid for test-based relevance tuning)

Peter also described a strategy for automatic optimization of the many query parameters available in Solr, using machine learning, based on some ideas first proposed by Simon Hughes of dice.com. Elsevier have also developed a Phrase Service API, which helps improve phrase based search over the standard un-ordered ‘bag of words’ model by recognising acronyms, chemical formulae, species, geolocations and more, expanding the original phrase based on these terms and then boosting them using Solr’s query parameters. He also mentioned a ‘push API’ available for data providers to push data directly into DataSearch. This was a necessarily brief dive into what is obviously a highly complex and powerful search engine built by Elsevier using many cutting-edge ideas.

Our next speaker, Michael Hardwick of Elite Software, talked about how textual data is stored in PDF files and the implications for extracting this data for search applications. In an engaging (and at some times slightly horrifying) talk he showed how PDFs effectively contain instructions for ‘painting’ characters onto the page and how certain essential text items such as spaces may not be stored at all. He demonstrated how fonts are stored within the PDF itself, how character encodings may be deliberately incorrect to prevent copy-and-paste operations and in general how very little if any semantic information is available. Using newspaper content as an example he showed how reading order is often difficult to extract as the PDF layout is a combination of the text from the original author and how it has been laid out on the page by an editor – so the headline may be have been added after the article text, which itself may have been split up into sections.

Tables in PDFs were described as a particular issue when attempting to extract numerical data for re-use – the data order may not be in the same order as it appears, for example if only part of a table is updated each week a regular publication appears. With PDF files sometimes compressed and encrypted the task of data extraction can become even more difficult. Michael laid out the choices available to those wanting to extract data: optical character recognition, a potentially very expensive Adobe API (that only gives the same quality of output as copy-and-paste), custom code as developed by his company and finally manual retyping, the latter being surprisingly common.

Thanks to both our speakers and our hosts Elsevier – we’re planning another Meetup soon, hopefully in mid to late June.

The post London Lucene/Solr Meetup – Relevance tuning for Elsevier’s Datasearch & harvesting data from PDFs appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/05/03/london-lucene-solr-meetup-elseviers-datasearch-harvesting-data-from-pdfs/feed/ 0
Haystack, the relevance conference – birth of a new profession? http://www.flax.co.uk/blog/2018/04/16/birth-new-profession-haystack-relevance-conference/ http://www.flax.co.uk/blog/2018/04/16/birth-new-profession-haystack-relevance-conference/#respond Mon, 16 Apr 2018 15:34:13 +0000 http://www.flax.co.uk/?p=3773 I’ve just returned from Charlottesville, Virginia and the Haystack search relevance conference hosted by our partners Open Source Connections. The venues were their own office and the Random Row brewery next door – added once they realised that the event … More

The post Haystack, the relevance conference – birth of a new profession? appeared first on Flax.

]]>
I’ve just returned from Charlottesville, Virginia and the Haystack search relevance conference hosted by our partners Open Source Connections. The venues were their own office and the Random Row brewery next door – added once they realised that the event had outgrown its humble beginnings as a small, informal event for maybe 50 people into a professional conference for over twice that number with attendees from as far afield as the west coast of the US, Poland and of course the UK. I’ll be writing up each day of the event and what I learned from the talks in blogs to follow, but wanted to start with my overall impressions.

I don’t think I’ve been to any other conference with such a strong sense of community or such a high quality of presentations. It was particularly refreshing to be among a group of people with such a level of search expertise and experience that at no point did anything have to be ‘dumbed down’ or over-explained. The attendee list included open source committers from projects including Apache Lucene/Solr and Apache Tika, experts in commercial search, authors of books I’ve long regarded as essential for anyone working in this field, independent consultants and those working for huge global companies. The talks were well programmed, ran exactly to schedule and covered cutting-edge topics. Between these talks the networking was relaxed and friendly and I had a chance to get to know several people in real life that I’ve previously only connected with online.

I think this conference may also have signalled the birth of a new profession of “relevance engineer” – someone who can understand both the business and technical aspects of search relevance, work with a variety of underlying search engines and expertly use the correct tools for the job to drive a continuing process of search quality improvement. Personally, I learnt a huge amount of useful information, made connections with many others in our field and have pages of notes to follow up on.

Last but no means least is to extend my personal thanks to all at OSC who created, planned and ran the event – as a veteran of many events in both technical and non-technical fields I understand very well how much work goes into them, especially if you’re not an event planner by profession! You opened your doors to us and made us all feel very welcome and you all worked extremely hard to make this one of the best conferences I’ve ever attended.

More to follow on day 1 and day 2 soon.

The post Haystack, the relevance conference – birth of a new profession? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/04/16/birth-new-profession-haystack-relevance-conference/feed/ 0
When even the commercial vendors are using it, has open source search won? http://www.flax.co.uk/blog/2018/03/15/even-commercial-vendors-using-open-source-search-won/ http://www.flax.co.uk/blog/2018/03/15/even-commercial-vendors-using-open-source-search-won/#respond Thu, 15 Mar 2018 12:03:32 +0000 http://www.flax.co.uk/?p=3718 There have been some interesting announcements recently which may point to an increasing realisation amongst commercial search firms that an open source model is an essential advantage in today’s search market. Coveo have announced that their enterprise search engine can … More

The post When even the commercial vendors are using it, has open source search won? appeared first on Flax.

]]>
There have been some interesting announcements recently which may point to an increasing realisation amongst commercial search firms that an open source model is an essential advantage in today’s search market. Coveo have announced that their enterprise search engine can run on an Elasticsearch core, an interesting move for a previously decidedly closed source company. BA Insight, who have previously provided extensions and enhancements for Microsoft’s decidedly closed-source Sharepoint search facility, have been offering Elasticsearch as a core search engine for quite a while. It is also an open secret that some other commercial search firms (such as Attivio) use Apache Lucene as a core technology.

The commercial search firms will have noticed that Lucidworks (who employ a large proportion of Lucene/Solr committers) have announced Lucidworks Fusion 4, which can be used for site and enterprise search. Elastic, the company behind Elasticsearch, recently acquired Swiftype and have repositioned it as a packaged site search engine (with an enterprise search version in beta and rumoured to appear later this year). Both Lucidworks and Elastic are thus attempting to capture a larger segment of the search market, using their dominance and expertise in the open source world. Note however that all these products are ‘open core’ rather than ‘open source’ (despite Elastic’s attempts to pretend otherwise) – which is not very different from Coveo or BA Insight’s approach – so the distance between the traditonally separate ‘open source’ and ‘closed source’ search vendors is now closing.

The question for any search vendor should be whether there is any point developing and maintaining a closed source search engine core, when Lucene derivatives such as Solr and Elasticsearch are so well established. The race between closed and open source is perhaps over.

Here at Flax we’ve been building open source search engines since 2001 and we’re independent of any vendor – so if you need help with your search project, do let us know.

Note: Enterprise Search is usually defined as a search engine working behind a corporate firewall, indexing different content sources such as flat files, databases and intranets. Site Search is usually visible to non-employees and only indexes websites. However, when site search includes an intranet the boundary becomes a little fuzzy – is this lightweight enterprise search? In most cases this doesn’t hugely matter – the underlying search engine core will be the same, it’s simply a difference in where source data comes from and how it is presented to users. However, these two options are often presented as different products by vendors.

UPDATE: A few days after I posted this blog, commercial vendor Attivio released SUIT, an open source user interface library that can run on their own engine, Elasticsearch or Solr. It seems the trend continues.

The post When even the commercial vendors are using it, has open source search won? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/03/15/even-commercial-vendors-using-open-source-search-won/feed/ 0
London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/ http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/#respond Thu, 08 Feb 2018 14:55:22 +0000 http://www.flax.co.uk/?p=3688 This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you! Our first talk was from Uwe … More

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you!

Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.

Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.

Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/feed/ 0