Archive for August, 2010

flax.crawler arrives

We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:

import crawler

crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()

Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.

We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.

Tags: , , ,

Posted in Technical

August 2nd, 2010

No Comments »