This crawler automates the following step:
# upload pdf to googledrive, store data and notify via email
python script/spider.py -c config/prod.cfg -u googledrive -s firebase -n gmail
# download all format
python script/spider.py --config config/prod.cfg --all
# download only one format: pdf|epub|mobi
python script/spider.py --config config/prod.cfg --type pdf
# download also additional material: source code (if exists) and book cover
python script/spider.py --config config/prod.cfg -t pdf --extras
# equivalent (default is pdf)
python script/spider.py -c config/prod.cfg -e
# download and then upload to Google Drive (given the download url anyone can download it)
python script/spider.py -c config/prod.cfg -t epub --upload googledrive
python script/spider.py --config config/prod.cfg --all --extras --upload googledrive
# download and then upload to OneDrive (given the download url anyone can download it)
python script/spider.py -c config/prod.cfg -t epub --upload onedrive
python script/spider.py --config config/prod.cfg --all --extras --upload onedrive
# download and notify: gmail|ifttt|join|pushover
python script/spider.py -c config/prod.cfg --notify gmail
# only claim book (no downloads):
python script/spider.py -c config/prod.cfg --notify gmail --claimOnly
Before you start you should
python --version
git clone https://github.com/niqdev/packtpub-crawler.git
pip install -r requirements.txt
(see also virtualenv)cp config/prod_example.cfg config/prod.cfg
[credential]
credential.email=PACKTPUB_EMAIL
credential.password=PACKTPUB_PASSWORD
Now you should be able to claim and download your first eBook
python script/spider.py --config config/prod.cfg
From the documentation, Google Drive API requires OAuth2.0 for authentication, so to upload files you should:
config/client_secrets.json
[googledrive]
...
googledrive.client_secrets=config/client_secrets.json
googledrive.gmail=GOOGLE_DRIVE@gmail.com
Now you should be able to upload your eBook to Google Drive
python script/spider.py --config config/prod.cfg --upload googledrive
Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate config/auth_token.json
.You should also copy and paste in the config the FOLDER_ID, otherwise every time a new folder with the same name will be created.
[googledrive]
...
googledrive.default_folder=packtpub
googledrive.upload_folder=FOLDER_ID
Documentation: OAuth, Quickstart, example and permissions
From the documentation, OneDrive API requires OAuth2.0 for authentication, so to upload files you should:
[onedrive]
...
onedrive.client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
onedrive.client_secret=XxXxXxXxXxXxXxXxXxXxXxX
Now you should be able to upload your eBook to OneDrive
python script/spider.py --config config/prod.cfg --upload onedrive
Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate config/session.onedrive.pickle
.
[onedrive]
...
onedrive.folder=packtpub
Documentation: Registration, Python API
To upload your eBook via scp
on a remote server update the configs
[scp]
scp.host=SCP_HOST
scp.user=SCP_USER
scp.password=SCP_PASSWORD
scp.path=SCP_UPLOAD_PATH
Now you should be able to upload your eBook
python script/spider.py --config config/prod.cfg --upload scp
Note:
scp.path
on the remote server must exists in advance--upload scp
is incompatible with --store
and --notify
Create a new Firebase project, copy the database secret from your settings
https://console.firebase.google.com/project/PROJECT_NAME/settings/database
and update the configs
[firebase]
firebase.database_secret=DATABASE_SECRET
firebase.url=https://PROJECT_NAME.firebaseio.com
Now you should be able to store your eBook details on Firebase
python script/spider.py --config config/prod.cfg --upload googledrive --store firebase
To send a notification via email using Gmail you should:
[gmail]
...
gmail.username=EMAIL_USERNAME@gmail.com
gmail.password=EMAIL_PASSWORD
gmail.from=FROM_EMAIL@gmail.com
gmail.to=TO_EMAIL_1@gmail.com,TO_EMAIL_2@gmail.com
Now you should be able to notify your accounts
python script/spider.py --config config/prod.cfg --notify gmail
[ifttt]
ifttt.event_name=packtpub-crawler
ifttt.key=IFTTT_MAKER_KEY
Now you should be able to trigger the applet
python script/spider.py --config config/prod.cfg --notify ifttt
Value mappings:
[join]
join.device_ids=DEVICE_IDS_COMMA_SEPARATED_OR_GROUP_NAME
join.api_key=API_KEY
Now you should be able to trigger the event
python script/spider.py --config config/prod.cfg --notify join
[pushover]
pushover.user_key=PUSHOVER_USER_KEY
pushover.api_key=PUSHOVER_API_KEY
Create a new branch
git checkout -b heroku-scheduler
Update the .gitignore
and commit your changes
# remove
config/prod.cfg
config/client_secrets.json
config/auth_token.json
# add
dev/
config/dev.cfg
config/prod_example.cfg
Create, config and deploy the scheduler
heroku login
# create a new app
heroku create APP_NAME --region eu
# or if you already have an existing app
heroku git:remote -a APP_NAME
# deploy your app
git push -u heroku heroku-scheduler:master
heroku ps:scale clock=1
# useful commands
heroku ps
heroku logs --ps clock.1
heroku logs --tail
heroku run bash
Update script/scheduler.py
with your own preferences.
More info about Heroku Scheduler, Clock Processes, Add-on and APScheduler
Build your image
docker build -t niqdev/packtpub-crawler:2.4.0 .
Run manually
docker run \
--rm \
--name my-packtpub-crawler \
niqdev/packtpub-crawler:2.4.0 \
python script/spider.py --config config/prod.cfg
Run scheduled crawler in background
docker run \
--detach \
--name my-packtpub-crawler \
niqdev/packtpub-crawler:2.4.0
# useful commands
docker exec -i -t my-packtpub-crawler bash
docker logs -f my-packtpub-crawler
Alternatively you can pull from Docker Hub this fork
docker pull kuchy/packtpub-crawler
Add this to your crontab to run the job daily at 9 AM:
crontab -e
00 09 * * * cd PATH_TO_PROJECT/packtpub-crawler && /usr/bin/python script/spider.py --config config/prod.cfg >> /tmp/packtpub.log 2>&1
Create two files in /etc/systemd/system:
[Unit]
Description=run packtpub-crawler
[Service]
User=USER_THAT_SHOULD_RUN_THE_SCRIPT
ExecStart=/usr/bin/python2.7 PATH_TO_PROJECT/packtpub-crawler/script/spider.py -c config/prod.cfg
[Install]
WantedBy=multi-user.target
[Unit]
Description=Runs packtpub-crawler every day at 7
[Timer]
OnBootSec=10min
OnActiveSec=1s
OnCalendar=*-*-* 07:00:00
Unit=packtpub_crawler.service
Persistent=true
[Install]
WantedBy=multi-user.target
Enable the script with sudo systemctl enable packtpub_crawler.timer
.You can test the service with sudo systemctl start packtpub_crawler.timer
and see the output with sudo journalctl -u packtpub_crawler.service -f
.
The script downloads also the free ebooks from the weekly packtpub newsletter.The URL is generated by a Google Apps Script which parses all the mails.You can get the code here, if you want to see the actual script, please clone the spreadsheet and go to Tools > Script editor...
.
To use your own source, modify in the config
url.bookFromNewsletter=https://goo.gl/kUciut
The URL should point to a file containing only the URL (no semicolons, HTML, JSON, etc).
You can also clone the spreadsheet to use your own Gmail account. Subscribe to the newsletter (on the bottom of the page) and create a filter to tag your mails accordingly.
Install paramiko with sudo -H pip install paramiko --ignore-installed
Install missing dependencies as described here
# install pip + setuptools
curl https://bootstrap.pypa.io/get-pip.py | python -
# upgrade pip
pip install -U pip
# install virtualenv globally
sudo pip install virtualenv
# create virtualenv
virtualenv env
# activate virtualenv
source env/bin/activate
# verify virtualenv
which python
python --version
# deactivate virtualenv
deactivate
Run a simple static server with
node dev/server.js
and test the crawler with
python script/spider.py --dev --config config/dev.cfg --all
This project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.
问题内容: 我正在进行ETL作业,该作业将JSON文件提取到RDS登台表中。我配置的搜寻器对JSON文件进行分类,只要它们的大小小于1MB。如果我缩小文件(而不是漂亮的打印件),并且结果小于1MB,它将对文件进行分类而不会出现问题。 我在想办法时遇到了麻烦。我尝试将JSON转换为BSON或GZIPing JSON文件,但仍被归类为UNKNOWN。 还有其他人遇到这个问题吗?有一个更好的方法吗? 问
我刚接触AWS胶水。我正在使用AWS Glue Crawler从两个S3存储桶中抓取数据。我每个桶里有一个文件。AWS Glue Crawler在AWS Glue数据目录中创建了两个表,我还可以在AWS Athena中查询数据。 我的理解是,为了在雅典娜中获取数据,我需要创建粘合作业,这将在雅典娜中提取数据,但我错了。如果说Glue crawler将数据放置在Athena中而不需要Glue job
我不断地接收并存储多个未压缩JSON对象的提要,这些提要每天都被分区到Amazon S3 bucket(配置单元样式:
我在爬的时候遇到了一个问题http://www.brand-in-trend.ru. 正如您在下面看到的,我使用Scrapy并定义了Basespider。第一个解析器工作得非常好,返回在start\u url上找到的所有品牌。 现在,当我想向类别解析器产生回调请求时,我没有得到响应,也没有得到错误。蜘蛛刚刚退出。 蜘蛛网: 我已经尝试了以下方法来解决此问题: 我测试了生成的品牌URL(例如。htt
当我运行Nutch命令~/Nutch/runtime/deploy$bin/Nutch crawl urls-dir/user/dlequoc/urls-depth 2-topn5时,我得到了以下异常: ====================================================================================== 你能帮忙吗?谢谢!
我正在实现一个网络爬虫,我正在使用Crawler4j库。我不是得到一个网站上的所有链接。我试图使用Crawler4j提取一个页面上的所有链接,但遗漏了一些链接。 这是页面上的URL列表,这是Crawler4J给出的URL列表。 我查看了crawler4j使用的'HTMLContentHandler.java'文件来提取链接。在此,仅提取与“src”和“href”链接相关联的链接。 我发现这些文件的