Posts Tagged ‘flax’

flax.crawler arrives

We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:

import crawler

crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()

Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.

We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.

Tags: , , ,

Posted in Technical

August 2nd, 2010

No Comments »

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Packaged solutions and customisability, the Python way

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

Tags: , , , ,

Posted in Technical

June 14th, 2010

No Comments »

The Times they are a-changing….

News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.

The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.

Tags: , , ,

Posted in Business, News

March 26th, 2010

No Comments »

Some new open source file filters & previewers

We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.

We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.

Feedback would of course be very welcome.

Tags: , ,

Posted in Technical

March 12th, 2010

No Comments »

Flax Newsletters

I’ve created a page with links to our Flax Newsletters – let us know if you would like to be added to the mailing list (or indeed, if you’d like to be removed from it).

Tags:

Posted in News

December 2nd, 2009

No Comments »

Finding French TV with Flax

We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.

mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.

This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.

Tags: , ,

Posted in News

November 26th, 2009

No Comments »

Replacing relational databases with search engines for simple lookups

One of the things we often notice about existing systems based on relational databases (RDB) is that as they scale to millions of items, simple lookup tasks become slow and inefficient. These tasks don’t usually require complicated database operations, so in most cases it is possible to relocate the data from the RDB into a search engine like Flax.

Consider a system where a search engine has already been implemented to search textual product information, but numerical data on each product, such as price, is still being stored in a RDB. Users will often need filters on search results such as ’show me items under £10′ and so a RDB operation similar to ‘SELECT productID FROM products WHERE price<£10‘ will be needed, in addition to the search engine query. Modern search engines like Flax implement range search functions, so that numerical information can be added to documents, and it is thus possible to carry out this operation in the search engine as part of the full-text search for the product information.

We’ve noticed with several clients that it is now possible to move all their data from the original RDB into the search engine. This can obviously lead to cost savings, as only one system must be hosted, maintained and backed up, and scaling out can be far simpler.

Another way to look at this is to consider a search engine as an example of a document-oriented database.

Tags: ,

Posted in Technical

August 27th, 2009

No Comments »

Whitepaper on enterprise search

Our technical partners Cognidox have released a whitepaper detailing their view of the enterprise search market, titled “Why you can’t just ‘Google’ for Enterprise Knowledge” – it’s well worth a read. You can download the PDF from their archive.

Tags: , , ,

Posted in News

July 13th, 2009

No Comments »

Enterprise search – for free

We recently helped a small marine consultancy, running a Windows network, implement a completely free enterprise search solution. Even SMEs are now finding it hard to keep on top of the information they produce, and there are few low-cost options for searching their documents. Read the case study here (PDF).

Tags: , ,

Posted in Business

July 10th, 2009

No Comments »