Posts Tagged ‘big data’

How not to predict the future of search

I’ve just seen an article titled Enterprise Search: 14 Industry Experts Predict the Future of Search which presents a list of somewhat contradictory opinions. I’m afraid I have some serious issues with the experts chosen and the undeniably blinkered views some of them have presented.

Firstly, if you’re going to ask a set of experts to write about Enterprise Search, don’t choose an expert in SEO as part of your list. SEO is not Enterprise Search, in fact a lot of the time it isn’t anything at all (except snake oil) – it’s a way of attempting to game the algorithms of web search engines. Secondly, at least make some attempt to prevent your experts from just listing the capabilities of their own companies in their answers: in fact one ‘expert’ was actually a set of PR-friendly answers from a company rather than a person, including listing articles about their own software. The expert from Microsoft rather predictably failed to notice the impact of open source on the search market, before going on to put a positive spin on the raft of acquisitions of search companies over the last few years (and it’s certainly not all good, as a recent writedown has proved). Apparently the acquisition of specialist search companies by corporate behemoths will drive innovation – that is, unless that specialist knowledge vanishes into the behemoth’s Big Data strategy, never to be seen again. Woe betide the past customers that have to get used to a brand new pricing, availability and support plan as well.

Luckily it wasn’t all bad – there were some sensible viewpoints on the need for better interaction with the user, the rise of semantic analysis and how the rise of open source is driving out inefficiency in the market – but the article is absolutely peppered with buzzwords (Big Data being the most prevalent, of course) and contains some odd cliches: “I think a generation of people believes the computer should respond like HAL 9000″…didn’t HAL 9000 kill most of the crew and attempt to lock the survivor outside the airlock?

I’m pretty sure this isn’t a feature we want to replicate in an Enterprise Search system.

Tags: , , , ,

Posted in News

May 15th, 2014

1 Comment »

Cambridge Search Meetup – Cassandra & Solr

A sunny evening last night for the latest Cambridge Search Meetup, which featured a couple of talks from Datastax on the highly scalable NoSQL database Apache Cassandra and how it is integrated with Apache Lucene/Solr. Jeremy Hanna started us off with a brief history of the Facebook-incubated Cassandra, which is a fully distributed, highly reliable system used by many including Netflix and Spotify with some customers running thousands of nodes in multiple data centres. Cassandra has its own SQL-like language, CQL3 and some basic collections such as Lists and Maps, but due to its fully distributed nature does lack some traditional features such as JOINs. Datastax themselves are now responsible for most of the ongoing work on Cassandra and offer the usual array of training, support, management services and tools. One common application mentioned was high speed and reliable recording of sensor data, increasingly important now with the rise of the Internet of Things.

After a short break for drinks and snacks (which this time were kindly sponsored by Datastax) Sergio Bossa told us how Solr is integrated with Cassandra, also running in a distributed fashion. Interestingly, this integration doesn’t use the same Zookeeper system as SolrCloud (the standard way to run clusters of Solr servers) but relies instead on Cassandra’s own internal scaling systems, passing data about using ‘gossip‘ between nodes. Zookeeper is not always the easiest thing to get running so an alternative is very interesting! Data can be added to the system over HTTP or the aforementioned CQL3 and after being entered into Cassandra’s tables is subsequently indexed by Solr. Queries can then be made over HTTP as usual. Some work is still necessary to prevent duplication of effort (at present one needs to create data structures in Cassandra and subsequently in Solr).

It was pleasing so see that so much care has been taken with this integration process and also that Datastax offer their Datastax Enterprise Search stack not only free for non-production use, but free to startups. Thanks to Jeremy, Sergio and all who came along and we’ll be back with another Search Meetup soon.

As Hadoop gains, does Lucene benefit?

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

Convergence and collisions in Enterprise Search

At the end of next month I’ll be at Enterprise Search Europe (I’m on the programme committee and help with the open source track) and the opening keynote this year is from Dale Roberts, author of the book Decision Sourcing. Dale will be talking about how Social, Big Data, Analytics and Enterprise Search are on a collision course and business leaders ignore these four themes at their peril.

So I wondered if we could see how in practical terms one might build systems based on these four themes. There are technical and logistical challenges of course (not least convincing someone to pay for the effort) but it’s worth exploring nonetheless.

Social in a business context can mean many things: social media is inherently noisy (and as far as I can see mostly cats) but when social tools are used within a business they can be a great way to encourage collaboration. We ourselves have added social features to search applications – user tagging of search results for example, to improve relevance for future searches and to help with de-duplication. Much has been made of the idea of finding not just relevant documents, but the subject matter experts that may have written them, or just other people in your organisation who are interested in the same subject. From a technical point of view none of this is particularly hard – you just have to add these social signals to your index and surface them in some intuitive way – but getting a high enough percentage of users to contribute to shared discussions and participate in tagging can be difficult.

Big Data is an overused term – but in a business context people usually apply it to very large collections of log files or other data showing how your customers are interacting with your business. A lot of search engine experts will tell you that Big Data isn’t always that ‘big’ – we’ve been dealing with collections of hundreds of millions or even billions of indexed items for many years now, the trick is scaling your solution appropriately (not just in technical terms, but in an economic way, as linearly as possible). If you’ve got a few million items, I’m sorry but you haven’t got Big Data, you’ve just got some data.

I’ve always been unsure of the benefits of search Analytics but I’m beginning to change my mind, having seen a some very impressive demos recently. Search engines have always counted things; the clever bit is allowing for queries that can surface unusual or interesting information, and using modern visualisation techniques to show this. Knowing the most popular search term may not be as important as spotting an unexpected one.

So we’ve indexed our data including tags, personnel records, internal chatrooms; put them all onto a elastically scalable platform and built some intuitive and useful interfaces to search and analyze our data. I’m pretty sure you could do all this with the open source technologies we have today (including Scrapy, Apache Lucene/Solr, Elasticsearch, Apache Hadoop, Redis, Logstash, Kibana, JQuery, Dropwizard, Python and Java). This isn’t the whole story though: you’d need a cross-disciplinary team within your organisation with the ability to gather user requirements and drive adoption, a suitable budget for prototyping, development and ongoing support and refinements to the system and a vision encompassing the benefits that it would bring your business. Not an inconsiderable challenge!

What questions should we be able to ask the system? I’ll leave that as an exercise for the reader.

See you in April! If you’d like a 20% discount on registration use the code HULL20. We’ll also be running an evening Meetup on Tuesday 29th April open to both conference attendees and others.

Time for the crystal ball again…

It’s always fun to make predictions about the future, especially as one can be pretty sure to be proved wrong in interesting ways. At the start of 2014 we at Flax are looking forward to another year of building open source search and we already have some great client projects in progress that we’ll shortly be able to talk about, but what else might be happening this year? Here’s some points to note:

  • The Elasticsearch project continues to add features at a prodigious rate during the arms race between it and Apache Solr – this battle can only be good news for end users in our view. We can expect a 1.0 release of Elasticsearch this year and several further major 4.x releases of Solr.
  • The Solr world has become slightly more complex as original author Yonik Seeley has left Lucidworks to start his own company, Heliosearch – with its own packaged distribution of Solr. How will Heliosearch contribute to the Solr ecosystem?
  • HP Autonomy is a sponsor of the Enterprise Search Europe conference this year, although there’s still some fallout from HP’s acquisition of Autonomy, and little news from the various official investigations into this process. Perhaps this year HP’s overall strategy will become a little clearer.
  • The Big Data bandwagon rolls on and more or less every search company now stresses its capabilities in this area for marketing purposes: but how big is Big? It’s not enough just to re-quote IDC’s latest study on how many exobytes everyone is producing these days, the value is in the detail, not the sheer volume: good (and deep) analytics is the key.
  • We think there might be some interesting things happening around open source search and bioinformatics soon – watch this space!

Tags: , , , , , ,

Posted in News

January 7th, 2014

No Comments »

Finding the elephant in the room: open source search & Hadoop grow closer together

I’ve been lucky enough to attend two talks on Hadoop in the last few weeks which has made me take a closer look at this technology. In case you didn’t know, Hadoop is an Apache top level open source project comprising a framework for distributed computing and storage, originally created by Doug Cutting (also the creator of Apache Lucene) while at Yahoo! in 2005. Distributed computing is carried out using MapReduce (roughly speaking, the ‘map’ bit involves splitting a processing task up into chunks and distributing these among various processing nodes, the ‘reduce’ bit brings all the results together again) and the storage uses the Hadoop Distributed File System (HDFS). There are other parts of Hadoop including a database (HBase), data warehouse with SQL-like language (Hive), scripting language (Pig) and more.

Those I’ve spoken to who have attempted to build applications on Hadoop have said that it’s very much a kit of parts rather than an integrated platform, so not that easy to get started with – which has led to the emergence of various vendors providing ‘curated’ distributions and support, much as Lucidworks does for Apache Lucene/Solr. Cloudera, Hortonworks, and MapR are just some of the best-known of these vendors. With everyone jumping on the BigData bandwagon these days some of these vendors have attracted significant interest and funding.

As you might expect full-text search is often required for these distributed systems and there have been various attempts to bring Hadoop and search closer together. Hortonworks support integration with Elasticsearch, although this currently appears to mean that you can use Hive or Pig to move data from Hadoop on or off a separate Elasticsearch cluster, rather than the search engine running on the cluster itself. Cloudera’s integration of Hadoop with Solr appears to be tighter, with Solr storing its indexes on HDFS directly (perhaps not surprising considering Lucene/Solr committer Mark Miller, who is responsible for most recent SolrCloud development, works for Cloudera). Cloudera even has its own data conditioning framework Flume (yes, it seems we need yet another data conditioning/pipelining solution!) and allows for distributed indexing. MapR have partnered with LucidWorks and integrated LucidWorks Search into their distribution. All these vendors are heavy contributors to Hadoop of course and most also contribute to Lucene/Solr or Elasticsearch.

Since Hadoop has been linked with search from the beginning one can hope that these integration efforts will continue – applications that require distributed search are becoming increasingly common and Hadoop, despite its nature as a kit of parts requiring assembly, is a good foundation to build on.

Elasticsearch meetup – Duedil, Hadoop and more

I visited the London Elasticsearch User Groupsmeetup last night for the first time, in the rather splendid HQ of Skills Matter just down from Old Street – the venue had a great buzz. The first speaker was Chris Simpson from Duedil who provide UK company information gleaned from Companies House and other sources. He told us about using Elasticsearch to provide faceted search (including some great clickable bar graphs for numerical range facets) and how they bulk index around 9 million company records in about an hour, using Elasticsearch’s alias features to swap in new indexes once they’re ready – so there is no impact on search performance while indexing. He mentioned a common problem with search engines, which is there is no easy way to be sure how much hardware you’ll need until you ‘know your data and know your hosts’.

Next up was Chris Harris from Hortonworks, who provide a packaged and supported Apache Hadoop distribution. He explained how Hadoop can be used for capturing huge numbers of transactions (these could be interactions with an e-commerce website for example) and for storing them in a distributed database on low-cost hardware. The Hive ‘SQL-like’ language can then be used to extract the data and send it directly to Elasticsearch, or indeed to run queries on Elasticsearch and send the results back to Hadoop as a table. Similar processes can be run with the Pig scripting language. There followed some interesting discussions about the future of Hadoop, where search engines such as Elasticsearch may run directly on Hadoop nodes, working with the data locally. It will be interesting to compare this with the approach taken by Cloudera who are talking on Hadoop & Solr this Thursday at our own Meetup in Cambridge.

Clinton Gormley from Elasticsearch finished up with a Q&A, during which he talked about the new Phrase Suggesters based on Lucene’s new Finite State Machines, and gave hints about when the long awaited 1.0 release of Elasticsearch will appear – apparently early 2014 is now likely.

Thanks to all the speakers and to Elasticsearch for the very welcome beer and pizza – this certainly won’t be our last visit to this user group on what is an increasingly adopted open source search engine.

Search events for Autumn 2013

As usual there’s several interesting events on the horizon this autumn: first up is an London Elasticsearch User Group Meetup on Monday September 9th, which will probably be shortly followed by a Cambridge Search Meetup once I have confirmed a venue – we’re hoping to include a talk on Solr & Hadoop.

November features Lucene Revolution, which this year is in Dublin – there are two days of training on 4th & 5th November followed by the conference itself on the 6th and 7th. We’re hoping to talk at this event if accepted (if you like, you can help us out by voting for our proposed topic “Turning Search upside down: using Lucene for very fast stored queries” which is based on some of our work for clients in the media monitoring sector). No fewer than four of the Flax team will be attending so we hope to catch up with you over a pint of Guinness there!

At the other end of November on the 27th is Search Solutions in London, a day of talks on all aspects of search hosted by the British Computer Society Information Retrieval Specialist Group. I’ll be running a training session the day before on open source search.

Do let us know of any other events in the area of search – we’re as ever very happy to publicise them.

A belated report on Enterprise Search Europe 2013

Earlier this month I attended the third Enterprise Search Europe conference, this time not to speak but to run workshops, panels, tracks and social events. On Tuesday a colleague and I gave a workshop on Getting the Best from Open Source Search which I hope was useful to attendees: one thing I did take away is how the level of experience with open source and indeed search technology itself can vary widely: some attendees had already experimented widely with Apache Lucene/Solr and some simply wanted to expand their knowledge of the associated risks & benefits of this approach.

The first day of the conference started with Ed Dale of Ernst & Young talking about implementing enterprise search for a truly global organisation. E&Y’s search is over a surprisingly small number of documents (only 2 million or so) but they are lucky enough to have a relatively large and experienced team running their search as an ongoing operation – no ‘fire and forget’ here (an approach often taken and seldom successfully). We moved on to hear from Kristian Norling on the second year of Findwise’s Enterprise Search Survey (some interesting numbers with the full results available soon) and then a fascinating and amusing talk from Joe Lamantia on the Language of Discovery, backed up by a second talk from Tyler Tate – it seems Discovery might a better term for what we call Search, at least from a usability perspective. The morning ended with Steven Arnold’s provocative take on how the performance of search technology hasn’t improved measurably in many decades due to processing limitations and how the rise of Big Data is only going to compound the problem.

The afternoon began with a panel session on the future of open source search – my personal thanks to Daniel Lee of Artirix, Eric Pugh of Open Source Connections and RenĂ© Kriegler for leading a lively discussion on the seemingly inexorable rise of open source search and what may happen next. There were some interesting points raised on how significant investment in open source search may change the picture. We continued in the open source theme with talks on open source solutions for the City of Antibes and Shopping24, before a drinks reception and then moving to the pub across the road for the combined London and Cambridge Search Meetup. Our theme was ‘The Nightmare before Search’ – some great (and unbloggable!) war stories on crazy search implementations was followed by networking late into the night.

The next day continued with a session on search implementation from speakers including Dan Foster of Legal & General, a track on Big Data during which we heard from Eric Pugh on building a very large scale system using open source software – sadly I had to drop out at this point for meetings and only returned for the closing plenary sessions. I particularly enjoyed Kara Pernice’s insights on how to build usable intranet search and Valentin Richter’s session on migrating to a new search technology (a topic on many minds especially for those using FAST ESP which goes out of mainstream support in a couple of months). Lynda Moulton did her best to sum up what we had learnt over the last few days – a very hard job when the event covered so many aspects of search & discovery.

Many thanks to Information Today and chair Martin White as ever for organising the event – although it was an intense few days it was great to catch up with everyone and to talk search. We’re looking forward to next year – did I hear a rumour that the Europe in the title might be more emphasized next time? We shall see!

Tags: , , , ,

Posted in events

May 28th, 2013

No Comments »

New Year predictions: further search storms ahead!

2012 has been a fascinating and stormy year for those of us in the search business. We’ve seen a raft of further acquisitions of commercial closed source search companies by bigger players, some convinced that what used to be called Enterprise Search is now a solution to Big Data (like Stephen Arnold we wonder what will succeed Big Data as the next marketing term – I love his phrase “In a quest for revenue, the vendors will wrap basic ideas in a cloud of unknowing”). One acquisition hasn’t gone so smoothly: Autonomy, bought by HP for a price that no-one in the search business thought was remotely sensible, has been accused of being oversold vapourware: this is a story that will continue to develop in 2013. If you want a great overview of the current market read Martin White’s latest research note.

Here in the slightly calmer waters of open source search, we’ve seen a huge rise in enquiries from often blue-chip companies, no longer needing persuasion that open source is a serious contender for even the largest search and content projects. Often these companies have considered large commercial solutions but are put off by both the price and high-pressure marketing tactics – in a world of reduced budgets you simply can’t sell magic beans for a pile of gold. We’ve also seen increased interest in related technologies such as machine learning and automatic categorisation – search really isn’t just about search any more.

At Flax we’re busier than we have ever been and we’re expected the trend to continue. We’re looking forward to running more Cambridge Search Meetups, visiting and helping organise conferences such as Enterprise Search Europe and Lucene Revolution, building our network of carefully chosen partners and of course working on exciting and cutting-edge development projects.

As the storms in our sector continue to rage overhead we’ll simply be getting on with what we do best, building effective search.

Tags: , , , , ,

Posted in Business, News

January 3rd, 2013

No Comments »