Summary
Whalebot is open-source web crawler. It is intended to be simple, fast and memory efficient. It was created as a targeted spider, but you may use it as common.
Current release 0.02
Current state. Bold - done, normal - TODO
If something broken or you have an idea, please visit http://groups.google.com/group/whalebot
Usages
- It was used for collecting papers on target thematic from http://citeseerx.ist.psu.edu for my master degree work
- Candidates for logo were collected using whalebot
- Eating own dogs food (links for url parsing benchmark)
Features
- Simple configuration from command line and text files
- Start/Stop/Resume fetching sessions