当前位置: 首页 > 软件库 > 数据库相关 > >

ensembl-hive

授权协议 Apache-2.0 License
开发语言 Java
所属分类 数据库相关
软件类型 开源软件
地区 不详
投 递 者 能钟展
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

eHive

eHive is a system for running computation pipelines on distributed computing resources - clusters, farms or grids.

The name comes from the way pipelines are processed by a swarm of autonomous agents.

Available documentation

The main entry point is available online in the usermanual, from where it canbe downloaded for offline access.

eHive in a nutshell

Blackboard, Jobs and Workers

In the centre of each running pipeline is a database that acts as a blackboard with individual tasks to be run.These tasks (we call them Jobs) are claimed and processed by "Worker bees" or just Workers - autonomous processesthat are continuously running on the compute farm and connect to the pipeline database to report about the progress of Jobsor claim some more. When a Worker discovers that its predefined time is up or that there are no more Jobs to do,it claims no more Jobs and exits the compute farm freeing the resources.

Beekeeper

A separate Beekeeper process makes sure there are always enough Workers on the farm.It regularly checks the states of both the blackboard and the farm and submits more Workers when needed.There is no direct communication between Beekeeper and Workers, which makes the system rather fault-tolerant,as crashing of any of the agents for whatever reason doesn't stop the rest of the system from running.

Analyses

Jobs that share same code, common parameters and resource requirements are typically grouped into Analyses,and generally an Analysis can be viewed as a "base class" for the Jobs that belong to it.However in some sense an Analysis also acts as a "container" for them.

An analysis is implemented as a Runnable file which is a Perl, Python orJava module conforming to a special interface. eHive provides some basicRunnables, especially one that allows running arbitrary commands (programsand scripts written in other languages).

PipeConfig file defines Analyses and dependency rules of the pipeline

eHive pipeline databases are molded according to PipeConfig files which are Perl modules conforming to a special interface.A PipeConfig file defines the stucture of the pipeline, which is a graph whose nodes are Analyses(with their code, parameters and resource requirements) and edges are various dependency rules:

  • Dataflow rules define how data that flows out of an Analysis can be used to trigger creation of Jobs in other Analyses
  • Control rules define dependencies between Analyses as Jobs' containers ("Jobs of Analysis Y can only start when all Jobs of Analysis X are done")
  • Semaphore rules define dependencies between individual Jobs on a more fine-grained level

There are also other parameters of Analyses that control, for example:

  • how many Workers can simultaneously work on a given Analysis,
  • how many times a Job should be tried until it is considered failed,
  • what should be automatically done with a Job if it needs more memory/time,etc.

Grid scheduler and Meadows

eHive has a generic interface named Meadow that describes how to interact with an underlying grid scheduler (submit jobs, query job's status, etc). eHive is compatible withIBM Platform LSF,Sun Grid Engine (now known as Oracle Grid Engine),HTCondor,PBS Pro,Docker Swarm and maybe others. Read more about this on the user manual.

Docker image

We have a Docker image available on the DockerHub. It can be used toshowcase eHive scripts (init_pipeline.pl, beekeeper.pl, runWorker.pl) in acontainer

Open a session in a new container (will run bash)

docker run -it ensemblorg/ensembl-hive

Initialize and run a pipeline

docker run -it ensemblorg/ensembl-hive init_pipeline.pl Bio::EnsEMBL::Hive::Examples::LongMult::PipeConfig::LongMult_conf -pipeline_url $URL
docker run -it ensemblorg/ensembl-hive beekeeper.pl -url $URL -loop -sleep 0.2
docker run -it ensemblorg/ensembl-hive runWorker.pl -url $URL

Docker Swarm

Once packaged into Docker images, a pipeline can actually be run under theDocker Swarm orchestrator, and thus on any cloud infrastructure that supportsit (e.g. Amazon Web Services,Microsoft Azure).

Read more about this on the user manual.

Contact us (mailing list)

eHive was originally conceived and used within EnsEMBL Compara groupfor running Comparative Genomics pipelines, but since then it has been separatedinto a separate software tool and is used in many projects both in Genome Campus, Cambridge and outside.There is eHive users' mailing list for questions, suggestions, discussions and announcements.

To subscribe to it please visit http://listserver.ebi.ac.uk/mailman/listinfo/ehive-users

 相关资料
  • 问题内容: 我正在尝试从Java连接到Hive服务器1。很久以前我在这个论坛上发现了一个问题,但这对我不起作用。我正在使用此代码: 这就是指南中显示的代码。我已经在.java的同一路径中复制了hive- metastore,service,jdbc,exec,core和更多.jar。当我编译它时,我得到以下消息: 有人知道这里发生了什么吗? 问题答案: 尝试 代替 希望您在代码中添加了语句

  • 问题内容: 这是该问题的后续问题,在这里我问什么是Hiveserver 2旧版Java客户端API。如果您不需要更多背景信息,那么这个问题应该能够在没有背景的情况下得以解决。 找不到有关如何使用hiverserver2旧版api的任何文档,我将它们放在一起。我能找到的最佳参考是Apache JDBC实现 。 我针对使用以下代码创建的Hiverserver2实例运行此代码 调试时,我从不走线 客户端

  • 问题内容: 我在Ubuntu 16.04上运行Hadoop 2.7.3,MySQL 5.7.17和Hive 2.1.1。 当我运行./hive时,我不断收到以下警告和异常: 这是我的hive-site.xml 为了解决该错误,我尝试了Hive无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient和无法实例化org.apa

  • 问题内容: 我已经使用hive在hbase中创建了一个表: 并创建了另一个表来加载数据: 最后将数据插入到hbase表中: 该表在hbase中如下所示: 我可以对JSON文件做同样的事情: 并做: 请帮忙 !:) 问题答案: 您可以使用该函数将数据解析为JSON对象。例如,如果您使用JSON数据创建登台表: 然后使用提取要加载到表中的属性: 有此功能的更全面的讨论在这里。

  • 本文向大家介绍Hive中存放是什么?相关面试题,主要包含被问及Hive中存放是什么?时的应答技巧和注意事项,需要的朋友参考一下 表。 存的是和hdfs的映射关系,hive是逻辑上的数据仓库,实际操作的都是hdfs上的文件,HQL就是用sql语法来写的mr程序。

  • 本文向大家介绍Hive与关系型数据库的关系?相关面试题,主要包含被问及Hive与关系型数据库的关系?时的应答技巧和注意事项,需要的朋友参考一下 没有关系,hive是数据仓库,不能和数据库一样进行实时的CURD操作。 是一次写入多次读取的操作,可以看成是ETL工具。