我正在使用scrapy来获取数据,并且我想使用flask网络框架在网页中显示结果。但是我不知道如何在烧瓶应用程序中调用蜘蛛。我试图用它CrawlerProcess
来称呼我的蜘蛛,但出现了这样的错误:
ValueError
ValueError: signal only works in main thread
Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread
我这样的草率代码:
class EPGD(Item):
genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
db = DB_Con()
collection = db.getcollection(name, term)
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
self.collection.update({"genID":item['genID']}, dict(item), upsert=True)
yield item
sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')
for site in link:
url_list.append(site.xpath('a/@href').extract())
for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"
我的flask代码如下:
@app.route('/', methods=['GET', 'POST'])
def index():
process = CrawlerProcess()
process.crawl(EPGD_spider)
return redirect(url_for('details'))
@app.route('/details', methods = ['GET'])
def epgd():
if request.method == 'GET':
results = db['EPGD_test'].find()
json_results= []
for result in results:
json_results.append(result)
return toJson(json_results)
使用Flask Web框架时,如何称呼我的抓狂蜘蛛?
在你的Spider前面添加HTTP服务器并不是那么容易。有几种选择。
所有示例的目录结构应如下所示,我正在使用dirbot测试项目
> tree -L 1
├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py
这是在新过程中启动Scrapy的代码示例:
# server.py
import subprocess
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello_world():
"""
Run spider in another process and store items in file. Simply issue command:
> scrapy crawl dmoz -o "output.json"
wait for this command to finish, and read output.json to client.
"""
spider_name = "dmoz"
subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
with open("output.json") as items_file:
return items_file.read()
if __name__ == '__main__':
app.run(debug=True)
将其另存为server.py并访问localhost:5000,你应该可以看到被抓取的项目。
2. Twisted-Klein + Scrapy
其他更好的方法是使用一些现有项目,该项目将Twisted与Werkzeug集成在一起,并显示类似于Flask的API,例如Twisted-Klein。Twisted-Klein允许你在与Web服务器相同的过程中异步运行蜘蛛。最好不要在每个请求上都阻塞,它使你可以简单地从HTTP路由请求处理程序返回Scrapy / Twisted延迟。
在代码片段将Twisted-Klein与Scrapy集成之后,请注意,你需要创建自己的CrawlerRunner基类,以便Crawler可以收集项目并将其返回给调用方。此选项稍微高级一些,你正在以与Python服务器相同的方式运行Scrapy Spider,项目不是存储在文件中而是存储在内存中(因此,与前面的示例一样,没有磁盘写/读操作)。最重要的是,它是异步的,并且都在一个Twisted反应器中运行。
# server.py
import json
from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from dirbot.spiders.dmoz import DmozSpider
class MyCrawlerRunner(CrawlerRunner):
"""
Crawler object that collects items and returns output after finishing crawl.
"""
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
# create crawler (Same as in base CrawlerProcess)
crawler = self.create_crawler(crawler_or_spidercls)
# handle each item scraped
crawler.signals.connect(self.item_scraped, signals.item_scraped)
# create Twisted.Deferred launching crawl
dfd = self._crawl(crawler, *args, **kwargs)
# add callback - when crawl is done cal return_items
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
"""
:param output: items scraped by CrawlerRunner
:return: json with list of items
"""
# this just turns items into dictionaries
# you may want to use Scrapy JSON serializer here
return json.dumps([dict(item) for item in output])
@route("/")
def schedule(request):
runner = MyCrawlerRunner()
spider = DmozSpider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
return deferred
run("localhost", 8080)
将上面的内容保存在server.py文件中,并将其放在你的Scrapy项目目录中,现在打开localhost:8080,它将启动dmoz spider并将作为json抓取的项目返回到浏览器。
3. ScrapyRT
当你尝试在蜘蛛程序前面添加HTTP应用程序时,会出现一些问题。例如,你有时需要处理蜘蛛日志(在某些情况下可能需要它们),需要以某种方式处理蜘蛛异常等。有些项目可让你以更简单的方式向蜘蛛添加HTTP API,例如ScrapyRT。这是一个将HTTP服务器添加到你的Scrapy Spiders并为你处理所有问题(例如,处理日志记录,处理Spider错误等)的应用程序。
因此,在安装ScrapyRT之后,你只需要执行以下操作:
> scrapyrt
问题内容: 我是swift的新手,任何人都可以帮助我快速集成PayU Money…。我正在使用此SDK:https : //github.com/payu-intrepos/Documentations/wiki/8.1/NEW-iOS-Seamless -SDK集成 问题答案: 这个答案来自PayU文档本身,我在这里回答的原因只是因为花了我几个小时才能实现他们的文档。 嗨,我可以为您提供NON无
Spring Boot如何整合MyBatis? 如果在 Service 层有一些业务逻辑需要对 Mapper 层返回的数据进行进一步处理,有没有一些最佳实践来确保代码的可读性和可维护性?
情景是这样,我的项目,一两个、几个数据操作速度还是可以的。但操作一多起来,速度就变得超慢的。我严重怀疑是我用错了,想请问大佬们都是怎么使用的?
网上有很多方法可以将Cucumber与Spring Boot结合起来。但我也找不到如何使用Mockito。如果我使用Cucumber runner,并使用ContextConfiguration和SpringBootTest对steps文件进行注释,那么容器将注入自动连接的依赖项及其所有特性。问题是,用Mock、MockBean和injectmock注释的依赖项不起作用。有人知道它为什么不起作用,
我如何让骆驼使用这个数据源bean?如何告诉Camel使用新建的SQL会话工厂,而不是尝试从配置文件构建? 在github中创建了示例appl,它使用内存中的db(h2) sample-app 获取NPE使用者[MyBatis://GetClaimInfo?StatementType=Selectone]轮询终结点失败:终结点[MyBatis://GetClaimInfo?StatementTyp
如题,如何在Java项目中调用ChatGPT的接口,我看了官方文档看不懂。