当前位置: 首页 > 软件库 > 开发工具 > 编译器 >

pipelines

An experimental programming language for data flow
授权协议 MIT License
开发语言 Python
所属分类 开发工具、 编译器
软件类型 开源软件
地区 不详
投 递 者 池赞
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

Pipelines is a language and runtime for crafting massively parallel pipelines. Unlike other languages for defining data flow, the Pipeline language requires implementation of components to be defined separately in the Python scripting language. This allows the details of implementations to be separated from the structure of the pipeline, while providing access to thousands of active libraries for machine learning, data analysis and processing. Skip to Getting Started to install the Pipeline compiler.

An example

As an introductory example, a simple pipeline for Fizz Buzz on even numbers could be written as follows -

from fizzbuzz import numbers
from fizzbuzz import even
from fizzbuzz import fizzbuzz
from fizzbuzz import printer

numbers
/> even 
|> fizzbuzz where (number=*, fizz="Fizz", buzz="Buzz")
|> printer

Meanwhile, the implementation of the components would be written in Python -

def numbers():
    for number in range(1, 100):
        yield number

def even(number):
    return number % 2 == 0

def fizzbuzz(number, fizz, buzz):
    if number % 15 == 0: return fizz + buzz
    elif number % 3 == 0: return fizz
    elif number % 5 == 0: return buzz
    else: return number

def printer(number):
    print(number)

Running the Pipeline document would safely execute each component of the pipeline in parallel and output the expected result.

The imports

Components are scripted in Python and linked into a pipeline using imports. The syntax for an import has 3 parts - (1) the path to the module, (2) the name of the function, and (3) the alias for the component. Here's an example -

from parser import parse_fasta as parse

That's really all there is to imports. Once a component is imported it can be referenced anywhere in the document with the alias.

The stream

Every pipeline is operated on a stream of data. The stream of data is created by a Python generator. The following is an example of a generator that generates a stream of numbers from 0 to 1000.

def numbers():
    for number in range(0, 1000):
        yield number

Here's a generator that reads entries from a file

def customers():
    for line in open("customers.csv", 'r'):
        yield line

The first component in a pipeline is always the generator. The generator is run in parallel with all other components and each element of data is passed through the other components.

from utils import customers             as customers # a generator function in the utils module
from utils import parse_row             as parser
from utils import get_recommendations   as recommender
from utils import print_recommendations as printer

customers |> parser |> recommender |> printer

The pipes

Pipes are what connect components together to form a pipeline. As of now, there are 2 types of pipes in the Pipeline language - (1) transformer pipes, and (2) filter pipes. Transformer pipes are used when input is to be passed through a component. For example, a function can be defined to determine the potential of a particle and a function can be defined to print the potential.

particles |> get_potential |> printer

The above pipeline code would pass data from the stream generated by particles through get_potential and then the output of get_potential through printer. Filter pipes work similarly except they use the following component to filter data. For example, a function can be defined to determine if a person is over 50 and then print their names to a file.

population /> over_50 |> printer

This would use the function referenced by over_50 to filter out data from the stream generated by population and then pass output to printer.

The where keyword

The where keyword lets you pass in multiple parameters to a component as opposed to just what the output from the previous component was. For example, a function can be defined to print to a file the names of all applicants under a certain age.

applicants
|> printer where (person=*, age_limit=21)

This could be done using a filter as well.

applicants
/> age_limit where (person=*, age=21)
|> printer

In this case, the function for age_limit could look something like this -

def age_limit(person, age):
    return person.age <= age

Note that this function still has just one return value - the boolean expression that is used to determine wether input to the component is passed on as output.

The to keyword

The to keyword is for when you want the previous component has multiple return values and you want to specify which ones to pass on to the next component. As an example, if you had a function for calculating the electronegativity and electron affinity of an atom, you could use it in a pipeline as follows -

atoms
|> calculator to (electronegativity, electron_affinity)
|> printer where (line=electronegativity)

Here's an example using a filter.

atoms
/> below where (atom=*, limit=2) to (is_below, electronegativity, electron_affinity) with is_below
|> printer where (line=electronegativity)

Note the use of the with keyword here. This is necessary for filters to specify which return value of the function is used to filter out elements in the stream.

Getting started

All you need to get started is the Pipelines compiler. You can install it by downloading the executable from Releases.

If you have the Nimble package manager installed and ~/.nimble/bin permanantly added to your PATH environment variable (look this up > if you don't know how to do this), you can also install by running the following command.

nimble install pipelines

Pipelines' only dependency is the Python interpreter being installed on your system. At the moment, most versions 2.7 and earlier are supported and support for Python 3 is in the works. Once Pipelines is installed and added to your PATH, you can create a .pipeline file, run or compile anywhere on your system -

$ pipelines
the .pipeline compiler (v:0.1.0)

usage:
  pipelines                Show this
  pipelines <file>         Compile .pipeline file
  pipelines <folder>       Compile all .pipeline files in folder
  pipelines run <file>     Run .pipeline file
  pipelines clean <folder> Remove all compiled .py files from folder

for more info, go to github.com/calebwin/pipelines

Some next steps

There are several things I'm hoping to implement in the future for this project. I'm hoping to implement some sort of and operator for piping data from the stream into multiple components in parallel with the output ending up in the stream in a nondeterministic order. Further down the line, I plan on porting the whole thing to C and putting in a complete error handling system

  • kaggle课程连接 https://www.kaggle.com/alexisbcook/pipelines. 简介 sklearn.pipeline()处理机制: Pipeline()的作用就是把所有模型塞到管道里,调用管道时依次对数据进行处理,得到最终的分类结果。使用管道的有点有: 1.更干净的代码:在预处理的每个步骤中考虑数据可能会变得混乱。 2.使用管道,您无需在每个步骤中手动跟踪培训和

  • Pipelines 我们从stream流说起,tream的API 相对来说,大家都很熟悉。 流有些什么问题呢?模糊:在不同场景里使用不同的工作模式: 有时时只读,有时只写,有时又是读写同时。 即使同一个情况下,都有可能一会只读,一会儿只写(比如:DeflateStream) 在一些双重流(NetworkStream,SslStream)情况下,几乎不可能知道数据何时终止传输了,只能不断的使用循环处

  • Pipelines 是将数据存储化操作 class MeijuPipeline(object): def process_item(self, item, spider): # 往文件中存储并且存储格式为 json # 要点:需要将出入的 item 对象进行转为为字典类型 json.dump(dict(item),open('meiju.js

  • 又是一篇干货很多的文章,我把代码都给记录下来了 ,算是笔记。 首先是spider 他这个导包这个真是蛋疼,pycharm会报错,这个要注意一下就行了,这个没啥玩意就是搞成item就行了 先弄一个item item import scrapy class LotterySpiderItem(scrapy.Item): # define the fields for your item h

相关阅读

相关文章

相关问答

相关文档