Posts Tagged ‘lucidworks’

As Hadoop gains, does Lucene benefit?

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

The closed-source topping on the open-source Elasticsearch

Today Elasticsearch (the company, not the software) announced their first commercial, closed-source product, a monitoring plugin for Elasticsearch (the software, not the company – yes I know this is confusing, one might suspect deliberately so). Amongst the raft of press releases there are a few small liberties with the truth, for example describing Elasticsearch (the company) as ‘founded in 2012 by the people behind the Elasticsearch and Apache Lucene open source projects’ – surely the latter project was started by Doug Cutting, who isn’t part of the aforementioned company.

Adding some closed-source dusting to a popular open-source distribution is nothing new of course – many companies do it, especially those that are venture funded – it’s a way of building intellectual property while also taking full advantage of the open-source model in terms of user adoption. Other strategies include curated distributions such as that offered by Heliosearch, founded by Solr creator Yonik Seeley and our partner LucidWorks‘ complete packaged search applications. It can help lock potential clients into your version of the software and your vision of the future, although of course they are still free to download the core and go it alone (or engage people like us to help do so), which helps them retain some control.

It’s going to be interesting to see how this strategy develops for Elasticsearch (for the last time, the company). At Flax we’ve also built various additional software components for search applications – but as we have no external investors to please these are freely available as open-source software, including Luwak our fast stored query engine, Clade a taxonomy/classification prototype and even some file format extractors.

Time for the crystal ball again…

It’s always fun to make predictions about the future, especially as one can be pretty sure to be proved wrong in interesting ways. At the start of 2014 we at Flax are looking forward to another year of building open source search and we already have some great client projects in progress that we’ll shortly be able to talk about, but what else might be happening this year? Here’s some points to note:

  • The Elasticsearch project continues to add features at a prodigious rate during the arms race between it and Apache Solr – this battle can only be good news for end users in our view. We can expect a 1.0 release of Elasticsearch this year and several further major 4.x releases of Solr.
  • The Solr world has become slightly more complex as original author Yonik Seeley has left Lucidworks to start his own company, Heliosearch – with its own packaged distribution of Solr. How will Heliosearch contribute to the Solr ecosystem?
  • HP Autonomy is a sponsor of the Enterprise Search Europe conference this year, although there’s still some fallout from HP’s acquisition of Autonomy, and little news from the various official investigations into this process. Perhaps this year HP’s overall strategy will become a little clearer.
  • The Big Data bandwagon rolls on and more or less every search company now stresses its capabilities in this area for marketing purposes: but how big is Big? It’s not enough just to re-quote IDC’s latest study on how many exobytes everyone is producing these days, the value is in the detail, not the sheer volume: good (and deep) analytics is the key.
  • We think there might be some interesting things happening around open source search and bioinformatics soon – watch this space!

Tags: , , , , , ,

Posted in News

January 7th, 2014

No Comments »

Lucene Revolution 2013, Dublin: day 1

Four of the Flax team are in Dublin this week for Lucene Revolution, almost certainly the largest event centred on open source search and specifically Lucene. There are probably a couple of hundred Lucene enthusiasts here and the event is being held at the Aviva Stadium on Landsdowne Road: look out the windows and you can see the pitch! Here are some personal reflections: a number of the talks I attended today have a connection to our own work in media monitoring which we’re talking about tomorrow.

Doug Turnbull’s Test Driven Relevancy was interesting, discussing OSC’s Quepid tool that allows content owners and search experts to work together to tweak and tune Solr’s options to present the right results for a query. I wondered whether this tool might eventually be used to develop a Learning to Rank option for Solr, as Lucene 4 now supports a pluggable scoring model.

I enjoyed Real-Time Inverted Search in the Cloud Using Lucene and Storm during which Joshua Conlin told us about running hundreds of thousands of stored queries in a distrubuted architecture. Storm in particular sounds worth investigating further. There is currently no attempt to reduce or ‘prune’ the set of queries before applying them: Joshua quoted speeds of 4000 queries/sec across their cluster of 8 instances: impressive numbers, but our own monitoring applications are working at 20 times that speed by working out which queries not to apply.

I broke out at this point to catch up with some contacts, including the redoubtable Iain Fletcher of Search Technologies – always a pleasure. After a sandwich lunch I went along to hear Andrzej Bialecki of Lucidworks talk about Sidecar Indexes, a method for allowing rapid updates to Lucene fields. This reminded me of our own experiments in this area using Lucene’s pluggable codecs.

Next was more from the Opensource Connections team, as John Berryman talked about their work to update a patent search application that uses a very old search syntax, BRS. This sounds very much the work we’ve done to translate one search engine syntax into another for various media monitoring companies – so far we can handle dtSearch and we’re currently finishing off support for HP/Autonomy Verity’s VQL (PDF).

This latter issue has got me thinking that perhaps it might be possible to collaboratively develop an open source search engine query language – various parsers could be developed to turn other search syntaxes into this language, and search engines like Lucene (or anything else) could then be extended to implement support for it. This would potentially allow much easier migration between search engine technologies. I’m discussing the concept with various folks at the event this week so do please get in touch if you are interested!

Back tomorrow with a further update on this exciting conference – tonight we’re all off to the Temple Bar area of Dublin for food and drink, generously provided by Lucidworks who should also be thanked for organising the Revolution.

Tags: , , , , , ,

Posted in Technical, events

November 6th, 2013

3 Comments »

Finding the elephant in the room: open source search & Hadoop grow closer together

I’ve been lucky enough to attend two talks on Hadoop in the last few weeks which has made me take a closer look at this technology. In case you didn’t know, Hadoop is an Apache top level open source project comprising a framework for distributed computing and storage, originally created by Doug Cutting (also the creator of Apache Lucene) while at Yahoo! in 2005. Distributed computing is carried out using MapReduce (roughly speaking, the ‘map’ bit involves splitting a processing task up into chunks and distributing these among various processing nodes, the ‘reduce’ bit brings all the results together again) and the storage uses the Hadoop Distributed File System (HDFS). There are other parts of Hadoop including a database (HBase), data warehouse with SQL-like language (Hive), scripting language (Pig) and more.

Those I’ve spoken to who have attempted to build applications on Hadoop have said that it’s very much a kit of parts rather than an integrated platform, so not that easy to get started with – which has led to the emergence of various vendors providing ‘curated’ distributions and support, much as Lucidworks does for Apache Lucene/Solr. Cloudera, Hortonworks, and MapR are just some of the best-known of these vendors. With everyone jumping on the BigData bandwagon these days some of these vendors have attracted significant interest and funding.

As you might expect full-text search is often required for these distributed systems and there have been various attempts to bring Hadoop and search closer together. Hortonworks support integration with Elasticsearch, although this currently appears to mean that you can use Hive or Pig to move data from Hadoop on or off a separate Elasticsearch cluster, rather than the search engine running on the cluster itself. Cloudera’s integration of Hadoop with Solr appears to be tighter, with Solr storing its indexes on HDFS directly (perhaps not surprising considering Lucene/Solr committer Mark Miller, who is responsible for most recent SolrCloud development, works for Cloudera). Cloudera even has its own data conditioning framework Flume (yes, it seems we need yet another data conditioning/pipelining solution!) and allows for distributed indexing. MapR have partnered with LucidWorks and integrated LucidWorks Search into their distribution. All these vendors are heavy contributors to Hadoop of course and most also contribute to Lucene/Solr or Elasticsearch.

Since Hadoop has been linked with search from the beginning one can hope that these integration efforts will continue – applications that require distributed search are becoming increasingly common and Hadoop, despite its nature as a kit of parts requiring assembly, is a good foundation to build on.

Meetups, genomes and hack days: Grant Ingersoll visits the UK

Lucene/Solr commiter, Mahout co-creator, LucidWorks co-founder and general all-round search expert Grant Ingersoll visited us last week on his way to the SIGIR conference in Dublin. We visited the European Bioinformatics Institute on the Wellcome Trust Genome Campus to hear about some fascinating projects using Lucene/Solr to index genomes, phenomes and proteins and for Grant to give a talk on recent developments in both Lucene/Solr and Mahout – it was gratifying that over 50 people turned up to listen and at least 30 of these indicated they were using the technology.

After a brief rest it was then time to travel to London so Grant could talk at the Enterprise Search London Meetup on both recent developments in Lucene/Solr and what he dubbed ‘Search engine (ab)use’ – some crazy use cases of Lucene/Solr including for very fast key/value storage. Some great statistics including how Twitter make new tweets searchable in around 50 microseconds using only 8-10 indexing servers.

Next it was back to Cambridge for our own Lucene/Solr hack day in a great new co-working space. Attendees ranged from those who had never used Lucene/Solr to those with significant search expertise, and some had come from as far away as Germany – after a brief introduction we split into several groups each mentored by a member of the Flax team. Two groups (one comprised entirely of those who had never used Lucene) worked on a dataset of tweets from UK members of parliament and a healthy sense of competition developed between them – you can see some of the code they developed at in our Github account including an entity extractor webservice. Another group, led by Grant, created a SolrCloud cluster, with around 1-2 million documents split into 2 shards – running on ten laptops over a wireless connection! Impressively this was set up in less than ten minutes. Others worked on their own applications including an index of proteins and there was even some work on the Lucene/Solr code itself.

We’re hoping to put the results of some of these projects live very soon, so you can see just what can be built in a single day using this powerful open source software. Thanks to all who came, our hosts at Cambridge Business Lounge and of course Grant for his considerable energy and invaluable expertise. If nothing else, we’ve introduced a lot more people to open source search and sparked some ideas, and we ended off the week with beer in a sunny pub garden which is always nice!

Search events for 2013

Here’s a quick roundup of search-related events coming soon:

Next week Lucene/Solr Revolution is to be held in San Diego, with a couple of days of training on April 29th & 30th and the main event on the 1st and 2nd May. This is probably the biggest event dedicated to Apache Lucene/Solr and features a huge array of presentations from Etsy, Wells Fargo, Lucidworks and even Microsoft who are increasingly supporting open source technologies.

Enterprise Search Europe is next on 15th and 16th May with a day of workshops on the 14th, including one from the Flax team. I’m looking forward to the various open source panels and presentations of course, and hearing from people from Ernst & Young, Neilsen Norman Group, Oracle and the University of Manchester. We’re also running a Meetup event on the first evening, open to all, with the usual informal mix of beer, snacks and search!

Some of the Flax team are hoping to attend Berlin Buzzwords on June 3rd & 4th – this conference promises to address “search”, “store” and “scale” – certainly sounds interesting! We know there will be lots of talks on elasticsearch and Lucene/Solr.

There’s more to come in the Autumn of course – more details when we know them. Hope to meet you at one of these great events!

Strange bedfellows? The rise of cloud based search

Last night our US partners Lucid Imagination announced that LucidWorks, their packaged and supported version of Apache Lucene/Solr, is available on Microsoft’s Azure cloud computing service. It seems like only a few weeks since Amazon announced their own CloudSearch system and no doubt other ’search as a service’ providers are waiting in the wings (we’re going to need a new acronym as SaaS is already taken!). At first the combination of a search platform based on open source Java code with Microsoft hosting might seem strange, and it raises some interesting questions about the future of Microsoft’s own FAST Search technology – is this final proof that FAST will only ever be part of Sharepoint and never a standalone product? However with search technology becoming more and more of a commodity this is a great option for customers looking for search over relatively small numbers of documents.

Lucid’s offering is considerably more flexible and full-featured than Amazon’s, which we hear is pretty basic with a lack of standard search features like contextual snippets and a number of bugs in the client software. You can see the latter in action at Runar Buvik’s excellent OpenTestSearch website. With prices for the Lucid service ranging from free for small indexes, this is certainly an option worth considering.

Enterprise Search Europe 2012 – Big Data, search surveys and some FUD from Google

I visited Enterprise Search Europe for the first day only last week, and caught a number of the presentations as well as giving one of my own (which I won’t discuss here but you’ll hear more about over the next few weeks). First up was Paul Doscher of Lucid Imagination with a lively presentation discussing whether search is either dead or now a commodity, or whether search on Hadoop is the new killer app for the emerging world of Big Data. We then had Kristian Norling from Findwise with some initial results from their survey on enterprise search – some interesting numbers here such as ‘18.5% of users are mostly/very satisfied with search’ and only ‘6% have a search strategy although 46% are planning one’ – we hear that Kristian is hoping to make the survey an annual one, which will be a great resource for anyone in the industry.

Matt Mullen, fuelled by diet cola, gave an introduction to search with a key point – that enterprise search usually performs a role within a workflow or task – a fact often ignored. Runar Buvik of Searchdaimon talked about a great resource he has developed comparing search engines, which can give some often amusing contrasts between different technologies, with some insisting there are no results for a particular query while others find thousands. I also enjoyed Emma Bayne and Donald Phillips polished presentation on the search facilities at the National Archives – interestingly although Autonomy is currently powering their search they are considering open source alternatives.

The day concluded with a presentation from Matt Eichner of Google, who turned up with their own film crew. You can read much of what he said at Computer World. I’m afraid I didn’t enjoy this presentation very much – it talked down to the audience and contained a lot of FUD around open source (surprising when Google uses and supports so much of it) – complete with sympathy-garnering pictures of babies in incubators and silly analogies about how one should prefer to fly in the airplane that cost the most. I hadn’t realised until his talk that the Google Search Appliance appears to be made of cheese!

It was great to network and catch up, and I hope next year to be able to attend the whole event. Thanks to all the organisers especially Martin White of Intranet Focus.

Amazon CloudSearch – a game changer?

Amazon have just launched a cloud-based search service, which promises a ‘fully managed search service in the cloud’ – and it certainly looks impressive, with auto-scaling built in. You simply create a service, upload documents as JSON or XML and then perform searches. For cases where you need to search publically available data this offers a great way to avoid having to install and integrate any search software – of course it won’t be so popular if you’re worried about where your data actually is, or other complications such as the Patriot Act.

As you might expect, some people are already offering services based around CloudSearch (we’d be happy to do the same - just ask!) and there’s a demo of searching Wikipedia available. I’m not sure who SmackBot is but I’m slightly wary of reading any Wikipedia articles it’s had something to do with…

Of course searching Wikipedia is nothing new and I sometimes wish for a different choice of source material for search demos.

One thing that seems clear is that with the rise of cloud-based search options (here’s another from our partners Lucid Imagination, based on Apache Lucene/Solr) the cost and complication of ’simple’ search projects could fall dramatically, applying further pressure to those companies selling closed source search engines for frankly unrealistic prices. Amazon’s offering, with their huge experience in cloud-based services, has the potential to be a game changer for the search market.

Tags: , , , ,

Posted in News, Technical

April 12th, 2012

No Comments »