>crawl.py http://www.hao123.com/index.htm
结果如下:
parsedurl = ParseResult(scheme='http', netloc='www.hao123.com', path='/index.htm', params='', query='', fragment='')
path = www.hao123.com/index.htm
ext = ('www.hao123.com/index', '.htm')
path = www.hao123.com/index.htm
ldir = www.hao123.com
ldir = www.hao123.com
path = www.hao123.com/index.htm
self.url = http://www.hao123.com/index.htm
self.file = www.hao123.com/index.htm
retval = ('www.hao123.com/index.htm', <httplib.HTTPMessage instance at 0x010F9968>)
( 1 )
URL: http://www.hao123.com/index.htm
FILE: www.hao123.com/index.htm
* http://www.hao123.com ... new, added to Q
* http://www.hao123.com/redian/tongzhi.htm ... new, added to Q
* http://utility.hao123.com/quality_form.php ... discarded, not in domain
* javascript:void(0) ... discarded, javascript
* http://www.hao123.com/redian/scookie.htm ... new, added to Q
* javascript:void(0) ... discarded, javascript
* javascript:void(0) ... discarded, javascript
* javascript:void(0) ... discarded, javascript
* http://www.hao123.com ... discarded, already in Q
* http://wenku.baidu.com ... discarded, not in domain
* http://baike.baidu.com ... discarded, not in domain
* http://jingyan.baidu.com ... discarded, not in domain
* http://hi.baidu.com ... discarded, not in domain
* http://top.baidu.com ... discarded, not in domain
* http://dict.baidu.com ... discarded, not in domain
* http://s.baidu.com ... discarded, not in domain
* http://www.baidu.com ... discarded, not in domain
* http://www.hao123.com/daquan/shfwsite.htm ... new, added to Q
* http://www.hao123.com/netbuy.htm ... new, added to Q
* http://www.hao123.com/caipiao.htm ... new, added to Q
* http://www.hao123.com/haoserver/index.htm ... new, added to Q
* http://www.hao123.com/tianqi.htm ... new, added to Q
* http://www.hao123.com/stock.htm ... new, added to Q
* http://www.hao123.com/stock3.htm ... new, added to Q
* http://www.hao123.com/bankjt.htm ... new, added to Q
* http://www.hao123.com/lvyou.htm ... new, added to Q
..........