google-group-crawler
is a Bash-4
script to download all (original)messages from a Google group archive.Private groups require some cookies string/file.Groups with adult contents haven't been supported yet.
The script requires bash-4
, sort
, curl
, sed
, awk
.
Make the script executable with chmod 755
and put them in your path(e.g, /usr/local/bin/
.)
The script may not work on Windows
environment as reported inhttps://github.com/icy/google-group-crawler/issues/26.
For private group, pleaseprepare your cookies file.
# export _CURL_OPTIONS="-v" # use curl options to provide e.g, cookies
# export _HOOK_FILE="/some/path" # provide a hook file, see in #the-hook
# export _ORG="your.company" # required, if you are using Gsuite
export _GROUP="mygroup" # specify your group
./crawler.sh -sh # first run for testing
./crawler.sh -sh > curl.sh # save your script
bash curl.sh # downloading mbox files
You can execute curl.sh
script multiple times, as curl
will skipquickly any fully downloaded files.
After you have an archive from the first run you only need to add the latestmessages as shown in the feed. You can do that with -rss
option and theadditional _RSS_NUM
environment variable:
export _RSS_NUM=50 # (optional. See Tips & Tricks.)
./crawler.sh -rss > update.sh # using rss feed for updating
bash update.sh # download the latest posts
It's useful to follow this way frequently to update your local archive.
To download messages from private group or group hosted by your organization,you need to provide some cookie information to the script. In the past,the script uses wget
and the Netscape cookie file format,now we are using curl
with cookie string and a configuration file.
Open Firefox, press F12 to enable Debug mode and select Network tabfrom the Debug console of Firefox. (You may find a similar way foryour favorite browser.)
Log in to your testing google account, and access your group.For examplehttps://groups.google.com/forum/?_escaped_fragment_=categories/google-group-crawler-public(replace google-group-crawler-public
with your group name).Make sure you can read some contents with your own group URI.
Now from the Network tab in Debug console, select the addressand select Copy -> Copy Request Headers
. You will have a lot ofthings in the result, but please paste them in your text editorand select only Cookie
part.
Now prepare a file curl-options.txt
as below
user-agent = "Mozilla/5.0 (X11; Linux x86_64; rv:74.0) Gecko/20100101 Firefox/74.0"
header = "Cookie: <snip>"
Of course, replace the <snip>
part with your own cookie strings.See man curl
for more details of the file format.
Specify your cookie file by _CURL_OPTIONS
:
export _CURL_OPTIONS="-K /path/to/curl-options.txt"
Now every hidden group can be downloaded :)
If you want to execute a hook
command after a mbox
file is downloaded,you can do as below.
Prepare a Bash script file that contains a definition of __curl_hook
command. The first argument is to specify an output filename, and thesecond argument is to specify an URL. For example, here is simple hook
# $1: output file
# $2: url (https://groups.google.com/forum/message/raw?msg=foobar/topicID/msgID)
__curl_hook() {
if [[ "$(stat -c %b "$1")" == 0 ]]; then
echo >&2 ":: Warning: empty output '$1'"
fi
}
In this example, the hook
will check if the output file is empty,and send a warning to the standard error device.
Set your environment variable _HOOK_FILE
which should be the pathto your file. For example,
export _GROUP=archlinuxvn
export _HOOK_FILE=$HOME/bin/curl.hook.sh
Now the hook file will be loaded in your future output of commandscrawler.sh -sh
or crawler.sh -rss
.
The downloaded messages are found under $_GROUP/mbox/*
.
They are in RFC 822
format (possibly with obfuscated email addresses)and they can be converted to mbox
format easily before being importedto your email clients (Thunderbird
, claws-mail
, etc.)
You can also use mhonarc ultility to convertthe downloaded to HTML
files.
See also
Sometimes you may need to rescan / redownload all messages.This can be done by removing all temporary files
rm -fv $_GROUP/threads/t.* # this is a must
rm -fv $_GROUP/msgs/m.* # see also Tips & Tricks
or you can use _FORCE
option:
_FORCE="true" ./crawler.sh -sh
Another option is to delete all files under $_GROUP/
directory.As usual, remember to backup before you delete some thing.
+
)See also https://github.com/icy/google-group-crawler/issues/30parallel
support: @Pikrass has a script to download messages in parallel.It's discussed in the ticket https://github.com/icy/google-group-crawler/issues/32.The script: https://gist.github.com/Pikrass/f8462ff8a9af18f97f08d2a90533af31raw access denied
: @alexivkin mentioned he could use the print
functionto work-around the issue. See it herehttps://github.com/icy/google-group-crawler/issues/29#issuecomment-468810786This work is released under the terms of a MIT license.
This script is written by Anh K. Huynh.
He wrote this script because he couldn't resolve the problem by usingnodejs
, phantomjs
, Watir
.
New web technology just makes life harder, doesn't it?
Please skip this section unless your really know to work with Bash
and shells.
If you clean your files (as below), you may notice that it will bevery slow when re-downloading all files. You may consider to usethe -rss
option instead. This option will fetch data from a rss
link.
It's recommmeded to use the -rss
option for daily update. By default,the number of items is 50. You can change it by the _RSS_NUM
variable.However, don't use a very big number, because Google will ignore that.
Because Topics is a FIFO list, you only need to remove the last file.The script will re-download the last item, and if there is a new page,that page will be fetched.
ls $_GROUP/msgs/m.* \
| sed -e 's#\.[0-9]\+$##g' \
| sort -u \
| while read f; do
last_item="$f.$( \
ls $f.* \
| sed -e 's#^.*\.\([0-9]\+\)#\1#g' \
| sort -n \
| tail -1 \
)";
echo $last_item;
done
The list of threads is a LIFO list. If you want to rescan your list,you will need to delete all files under $_D_OUTPUT/threads/
You can set the time for mbox
output files, as below
ls $_GROUP/mbox/m.* \
| while read FILE; do \
date="$( \
grep ^Date: $FILE\
| head -1\
| sed -e 's#^Date: ##g' \
)";
touch -d "$date" $FILE;
done
This will be very useful, for example, when you want to use thembox
files with mhonarc
.
whalebot C/C++ web crawler Summary Whalebot is open-source web crawler. It is intended to be simple, fast and memory efficient. It was created as a targeted spider, but you may use it as common. Curre
如何判断一个请求是否为搜索引擎,我的回答是通过UserAgent字符串自己去判断,而他告诉我.net提供了Crawler属性。我还真没有使用过这个属性,但是我第一反应是微软做的这东西怎么可能把所有搜索引擎都能判断出来呢?这是不可能的阿,至少据我所知网络中还有很多“特殊的搜索引擎”比如说Email的爬虫。我想研究一下什么样特征的搜索引擎能够让Crawler属性为真。 在Web项目中大家
描述 (Description) java.time.MatchResult.group(int group)方法返回在上一个匹配操作期间由给定组捕获的输入子序列。 声明 (Declaration) 以下是java.time.MatchResult.group(int group)方法的声明。 int group(int group) 参数 (Parameters) group - 此匹配器模式
GROUP方法通常用于结合合计函数,根据一个或多个列对结果集进行分组 。 group方法只有一个参数,并且只能使用字符串。 例如,我们都查询结果按照用户id进行分组统计: Db::table('think_user') ->field('user_id,username,max(score)') ->group('user_id') ->select(); 生成的SQL语句
它几乎和Object3D是相同的,其目的是使得组中对象在语法上的结构更加清晰。 代码示例 const geometry = new THREE.BoxGeometry( 1, 1, 1 ); const material = new THREE.MeshBasicMaterial( {color: 0x00ff00} ); const cubeA = new THREE.Mesh( geometr
描述 (Description) java.time.Matcher.group()方法尝试查找与模式匹配的输入序列的下一个子序列。 声明 (Declaration) 以下是java.time.Matcher.group()方法的声明。 public String group() 返回值 (Return Value) 与前一个匹配匹配的(可能为空)子序列,以字符串形式。 IllegalState
描述 (Description) java.time.MatchResult.group()方法返回与上一个匹配项匹配的输入子序列。 声明 (Declaration) 以下是java.time.MatchResult.group()方法的声明。 String group() 返回值 (Return Value) 与前一个匹配匹配的(可能为空)子序列,以字符串形式。 异常 (Exceptions)
描述 (Description) 它们是相应动作元素的容器。 当一组操作显示在栏中时,它可以正常工作。 下表列出了Foundation中的按钮组 Sr.No. 按钮组和说明 1 Basics 可以使用.button-group将任意数量的按钮放在容器内。 2 Sizing 可以使用标准按钮来调整按钮组的大小。 3 Coloring 按钮组中的每个按钮可以单独着色,也可以使用同一个类对每个按钮进行着