flax.crawler arrives
We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:
import crawler
crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()
Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.
We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.
Tags: crawling, flax, open source, python
This entry was posted on Monday, August 2nd, 2010 at 3:22 pm and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
Leave a Reply