Posts Tagged ‘flax’

Meetups, genomes and hack days: Grant Ingersoll visits the UK

Lucene/Solr commiter, Mahout co-creator, LucidWorks co-founder and general all-round search expert Grant Ingersoll visited us last week on his way to the SIGIR conference in Dublin. We visited the European Bioinformatics Institute on the Wellcome Trust Genome Campus to hear about some fascinating projects using Lucene/Solr to index genomes, phenomes and proteins and for Grant to give a talk on recent developments in both Lucene/Solr and Mahout – it was gratifying that over 50 people turned up to listen and at least 30 of these indicated they were using the technology.

After a brief rest it was then time to travel to London so Grant could talk at the Enterprise Search London Meetup on both recent developments in Lucene/Solr and what he dubbed ‘Search engine (ab)use’ – some crazy use cases of Lucene/Solr including for very fast key/value storage. Some great statistics including how Twitter make new tweets searchable in around 50 microseconds using only 8-10 indexing servers.

Next it was back to Cambridge for our own Lucene/Solr hack day in a great new co-working space. Attendees ranged from those who had never used Lucene/Solr to those with significant search expertise, and some had come from as far away as Germany – after a brief introduction we split into several groups each mentored by a member of the Flax team. Two groups (one comprised entirely of those who had never used Lucene) worked on a dataset of tweets from UK members of parliament and a healthy sense of competition developed between them – you can see some of the code they developed at in our Github account including an entity extractor webservice. Another group, led by Grant, created a SolrCloud cluster, with around 1-2 million documents split into 2 shards – running on ten laptops over a wireless connection! Impressively this was set up in less than ten minutes. Others worked on their own applications including an index of proteins and there was even some work on the Lucene/Solr code itself.

We’re hoping to put the results of some of these projects live very soon, so you can see just what can be built in a single day using this powerful open source software. Thanks to all who came, our hosts at Cambridge Business Lounge and of course Grant for his considerable energy and invaluable expertise. If nothing else, we’ve introduced a lot more people to open source search and sparked some ideas, and we ended off the week with beer in a sunny pub garden which is always nice!

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Building a new press cuttings service for the Financial Times

Those of you who read my slides from Search Solutions 2010 will have spotted a case study on our work for the Financial Times, one of the world’s leading business news organisations.

When the Financial Times decided to bring their digital press cuttings in-house in summer 2010, they asked us to build a powerful ’search server’ that they could easily integrate into their existing product offerings.

We built an indexer for the XML source data and a RESTful Web Service API, offering search features including Boolean operators, phrase searches, area specifiers (search whole article, body, headline, byline or any combination), date range restrictions, similarity search (“articles like this one”) and faceted search. Also available is spelling correction and synonyms, and detailed logging of indexing and all searches.

This might sound like a complex task, but using open source technology we created this system within less than a fortnight. Initially designed as a small-scale prototype, the system scaled easily to indexing hundreds of thousands of pages. You can use the service at
http://presscuttings.ft.com.

Tags: , , , ,

Posted in News, Technical

October 25th, 2010

1 Comment »

Open source search engines and programming languages

So you’re writing a search-related application in your favourite language, and you’ve decided to choose an open source search engine to power it. So far, so good – but how are the two going to communicate?

Let’s look at two engines, Xapian and Lucene, and compare how this might be done. Lucene is written in Java, Xapian in C/C++ – so if you’re using those languages respectively, everything should be relatively simple – just download the source code and get on with it. However if this isn’t the case, you’re going to have to work out how to interface to the engine.

The Lucene project has been rewritten in several other languages: for C/C++ there’s Lucy (which includes Perl and Ruby bindings), for Python there’s PyLucene, and there’s even a .Net version called, not surprisingly, Lucene.NET. Some of these ‘ports’ of Lucene are ‘looser’ than others (i.e. they may not share the same API or feature set), and they may not be updated as often as Lucene itself. There are also versions in Perl, Ruby, Delphi or even Lisp (scary!) – there’s a full list available. Not all are currently active projects.

Xapian takes a different approach, with only one core project, but a sheaf of bindings to other languages. Currently these bindings cover C#, Java, Perl, PHP, Python, Ruby and Tcl – but interestingly these are auto-generated using the Simplified Wrapper and Interface Generator or SWIG. This means that every time Xapian’s API changes, the bindings can easily be updated to reflect this (it’s actually not quite that simple, but SWIG copes with the vast majority of code that would otherwise have to be manually edited). SWIG actually supports other languages as well (according to the SWIG website, “Common Lisp (CLISP, Allegro CL, CFFI, UFFI), Lua, Modula-3, OCAML, Octave and R. Also several interpreted and compiled Scheme implementations (Guile, MzScheme, Chicken)”) so in theory bindings to these could also be built relatively easily.

There’s also another way to communicate with both engines, using a search server. SOLR is the search server for Lucene, whereas for Xapian there is Flax Search Service. In this case, any language that supports Web Services (you’d be hard pressed to find a modern language that doesn’t) can communicate with the engine, simply passing data over the HTTP protocol.

Tags: , , , , , , ,

Posted in Technical

September 3rd, 2010

1 Comment »

flax.crawler arrives

We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:

import crawler

crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()

Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.

We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.

Tags: , , ,

Posted in Technical

August 2nd, 2010

No Comments »

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Packaged solutions and customisability, the Python way

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

Tags: , , , ,

Posted in Technical

June 14th, 2010

No Comments »

The Times they are a-changing….

News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.

The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.

Tags: , , ,

Posted in Business, News

March 26th, 2010

No Comments »

Some new open source file filters & previewers

We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.

We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.

Feedback would of course be very welcome.

Tags: , ,

Posted in Technical

March 12th, 2010

No Comments »

Flax Newsletters

I’ve created a page with links to our Flax Newsletters – let us know if you would like to be added to the mailing list (or indeed, if you’d like to be removed from it).

Tags:

Posted in News

December 2nd, 2009

No Comments »