配置Disco——基于erlang的map-reduce架构

房泉

2023-12-01

Disco是一套轻量的Map-Reduce系统，其核心部分由并行性能很高的Erlang语言开发，其外部编程接口为易于编程的Python语言。Disco可以实现在集群和多核计算机上的部属，并可以部署在Amazon EC2上。下面我们将介绍一下如何在Ubuntu系统上配置Disco系统。

Disco可以在其官方网站（ http://discoproject.org/）上下载，目前的最高版本为0.2（2009.4.7发布）。我在Ubuntu 8.10和9.04两个版本的系统上对disco进行了配置。过程如下：

1.安装必要的软件包

Disco依赖于如下的软件包，我们需要首先系统上安装完成：

SSH daemon and client 网址 http://www.openssh.com/，可sudo apt-get install ssh
Erlang/OTP R12B or newer 网址 http://www.erlang.org/，可sudo apt-get install erlang

Lighttpd 1.4.17 or newer 网址 http://lighttpd.net/，可sudo apt-get install lighttpd
Python 2.4 or newer 网址 http://www.python.org/，可sudo apt-get install python

Python setuptools 网址 http://pypi.python.org/pypi/setuptools，可sudo apt-get install python-setuptools
cJSON module for Python 网址 http://pypi.python.org/pypi/python-cjson，可sudo apt-get install python-cjson

2.编译和安装Disco

编译Disco很简单，只需要将Disco解压后在目录中直接 make 就可以，如果需要指定一个安装路径，可以用make install DESTDIR=***，将安装路径指定到***所代替的地址。如果需要在集群上来运行Disco，那么需要在集群的每一台机器上都配置好Disco。

3.配置Disco的运行环境

我们首先完成Disco在单机上的配置。

我们需要让disco用户在SSH时可以不需要密码登录

假设节点上没有有效的ssh-key，可以通过下面的命令创建一个：

ssh-keygen -N '' -f ~/.ssh/id_dsa

在单机上或者共享存储的机群上，可以通过下面的命令配置：

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

配置完成后，可以通过下面的命令进行测试，如果不需要登录密码说明配置成功：

ssh localhost erl

我们把Disco的安装路径称作$DISCOHOME$。

然后，我们需要在PYTHONPATH中增加$DISCOHOME$/pydisco，$DISCOHOME$/pydisco/disco，$DISCOHOME$/util，$DISCOHOME$/node/disconode等路径。

启动Disco。

我们通过下面的方法分别在单机上启动master和node节点的disco进程。

conf/start-master

conf/start-node

需要注意的是，最好在执行这两个命令关闭之前启动的beam和lighttpd进程：

sudo killall -9 beam

sudo killall -9 lighttpd

在浏览器中打开 http://localhost:7000 并且configure中增加可用的节点数（比如设定Nodes为localhost，Max Workers为2），并保存。

5）

通过以上的步骤，对于Disco设置基本完成，我们可以编写一个程序来测试Disco，比如网站上的统计词频的范例wordcount.py，其代码如下：

from disco.core import Disco, result_iterator

import disco

import sys

def fun_map(e, params):

return [(w, 1) for w in e.split()]

def fun_reduce(iter, out, params):

s = {}

for w, f in iter:

s[w] = s.get(w, 0) + int(f)

for w, f in s.iteritems():

out.add(w, f)

inputAddress = ["http://discoproject.org/chekhov.txt"]

results = disco.job(sys.argv[1], name = "wordcount",

input = inputAddress,

map = fun_map,

reduce = fun_reduce,

nr_reduces = 1,

sort = False)

for word, frequency in result_iterator(results):

print word,frequency

然后，运行python wordcount.py http://localhost:7000运行程序就可以，可以从http://localhost:7000来查看程序的运行状况。

更多的问题，可以参阅 http://discoproject.org/doc/start/troubleshoot.html

配置Disco——基于erlang的map-reduce架构

相关阅读

相关文章

相关问答

相关文档