We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:
crawler.dump = MyContentDumperImplementation()
Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.
We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.