Reflow is a system for incremental data processing in the cloud.Reflow enables scientists and engineers to compose existing tools(packaged in Docker images) using ordinary programming constructs.Reflow then evaluates these programs in a cloud environment,transparently parallelizing work and memoizing results. Reflow wascreated at GRAIL to manage our NGS (nextgeneration sequencing) bioinformatics workloads onAWS, but has also been used for many otherapplications, including model training and ad-hoc data analyses.
Reflow comprises:
Reflow thus allows scientists and engineers to write straightforwardprograms and then have them transparently executed in a cloudenvironment. Programs are automatically parallelized and distributedacross multiple machines, and redundant computations (even acrossruns and users) are eliminated by its memoization cache. Reflowevaluates its programsincrementally:whenever the input data or program changes, only those outputs thatdepend on the changed data or code are recomputed.
In addition to the default cluster computing mode, Reflow programscan also be run locally, making use of the local machine's Dockerdaemon (including Docker for Mac).
Reflow was designed to support sophisticated, large-scalebioinformatics workflows, but should be widely applicable toscientific and engineering computing workloads. It was builtusing Go.
Reflow joins a longlist of systemsdesigned to tackle bioinformatics workloads, but differ from these inimportant ways:
You can get binaries (macOS/amd64, Linux/amd64) for the latestrelease at the GitHub releasepage.
If you are developing Reflow,or would like to build it yourself,please follow the instructions in the section"Developing and building Reflow."
Reflow is distributed with an EC2 cluster manager, and a memoizationcache implementation based on S3. These must be configured beforeuse. Reflow maintains a configuration file in $HOME/.reflow/config.yaml
by default (this can be overridden with the -config
option). Reflow'ssetup commands modify this file directly. After each step, the currentconfiguration can be examined by running reflow config
.
Note Reflow must have access to AWS credentials and configuration in theenvironment (AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, AWS_REGION
) whilerunning these commands.
% reflow setup-ec2
% reflow config
cluster: ec2cluster
ec2cluster:
ami: ami-d0e54eb0
diskspace: 100
disktype: gp2
instancetypes:
- c1.medium
- c1.xlarge
- c3.2xlarge
- c3.4xlarge
- c3.8xlarge
- c3.large
- c3.xlarge
- c4.2xlarge
- c4.4xlarge
- c4.8xlarge
- c4.large
- c4.xlarge
- cc2.8xlarge
- m1.large
- m1.medium
- m1.small
- m1.xlarge
- m2.2xlarge
- m2.4xlarge
- m2.xlarge
- m3.2xlarge
- m3.large
- m3.medium
- m3.xlarge
- m4.16xlarge
- m4.4xlarge
- m4.xlarge
- r4.xlarge
- t1.micro
- t2.large
- t2.medium
- t2.micro
- t2.nano
- t2.small
keyname: ""
maxinstances: 10
region: us-west-2
securitygroup: <a newly created security group here>
sshkey: <your public SSH key here>
https: httpsca,$HOME/.reflow/reflow.pem
After running reflow setup-ec2
, we see that Reflow created a newsecurity group (associated with the account's default VPC), andconfigured the cluster to use some default settings. Feel free toedit the configuration file ($HOME/.reflow/config.yaml
) to yourtaste. If you want to use spot instances, add a new key under ec2cluster
:spot: true
.
Reflow only configures one security group per account: Reflow will reusea previously created security group if reflow setup-ec2
is run anew.See reflow setup-ec2 -help
for more details.
Next, we'll set up a cache. This isn't strictly necessary, but we'llneed it in order to use many of Reflow's sophisticated caching andincremental computation features. On AWS, Reflow implements a cachebased on S3 and DynamoDB. A new S3-based cache is provisioned byreflow setup-s3-repository
and reflow setup-dynamodb-assoc
, eachof which takes one argument naming the S3 bucket and DynamoDB tablename to be used, respectively. The S3 bucket is used to store fileobjects while the DynamoDB table is used to store associationsbetween logically named computations and their concrete output. Notethat S3 bucket names are global, so pick a name that's likely to beunique.
% reflow setup-s3-repository reflow-quickstart-cache
reflow: creating s3 bucket reflow-quickstart-cache
reflow: created s3 bucket reflow-quickstart-cache
% reflow setup-dynamodb-assoc reflow-quickstart
reflow: creating DynamoDB table reflow-quickstart
reflow: created DynamoDB table reflow-quickstart
% reflow config
assoc: dynamodb,reflow-quickstart
repository: s3,reflow-quickstart-cache
<rest is same as before>
The setup commands created the S3 bucket and DynamoDB table asneeded, and modified the configuration accordingly.
We're now ready to run our first "hello world" program!
Create a file called "hello.rf" with the following contents:
val Main = exec(image := "ubuntu", mem := GiB) (out file) {"
echo hello world >>{{out}}
"}
and run it:
% reflow run hello.rf
reflow: run ID: 6da656d1
ec2cluster: 0 instances: (<=$0.0/hr), total{}, waiting{mem:1.0GiB cpu:1 disk:1.0GiB
reflow: total n=1 time=0s
ident n ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
hello.Main 1 1 0B
a948904f
Here, Reflow started a new t2.small
instance (Reflow matches the workload withavailable instance types), ran echo hello world
inside of an Ubuntu container,placed the output in a file, and returned its SHA256 digest. (Reflow representsfile contents using their SHA256 digest.)
We're now ready to explore Reflow more fully.
Let's explore some of Reflow's features through a simple task:aligning NGS read data from the 1000genomes project. Createa file called "align.rf" with the following. The code is commentedinline for clarity.
// In order to align raw NGS data, we first need to construct an index
// against which to perform the alignment. We're going to be using
// the BWA aligner, and so we'll need to retrieve a reference sequence
// and create an index that's usable from BWA.
// g1kv37 is a human reference FASTA sequence. (All
// chromosomes.) Reflow has a static type system, but most type
// annotations can be omitted: they are inferred by Reflow. In this
// case, we're creating a file: a reference to the contents of the
// named URL. We're retrieving data from the public 1000genomes S3
// bucket.
val g1kv37 = file("s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz")
// Here we create an indexed version of the g1kv37 reference. It is
// created using the "bwa index" command with the raw FASTA data as
// input. Here we encounter another way to produce data in reflow:
// the exec. An exec runs a (Bash) script inside of a Docker image,
// placing the output in files or directories (or both: execs can
// return multiple values). In this case, we're returning a
// directory since BWA stores multiple index files alongside the raw
// reference. We also declare that the image to be used is
// "biocontainers/bwa" (the BWA image maintained by the
// biocontainers project).
//
// Inside of an exec template (delimited by {" and "}) we refer to
// (interpolate) values in our environment by placing expressions
// inside of the {{ and }} delimiters. In this case we're referring
// to the file g1kv37 declared above, and our output, named out.
//
// Many types of expressions can be interpolated inside of an exec,
// for example strings, integers, files, and directories. Strings
// and integers are rendered using their normal representation,
// files and directories are materialized to a local path before
// starting execution. Thus, in this case, {{g1kv37}} is replaced at
// runtime by a path on disk with a file with the contents of the
// file g1kv37 (i.e.,
// s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz)
val reference = exec(image := "biocontainers/bwa:v0.7.15_cv3", mem := 6*GiB, cpu := 1) (out dir) {"
# Ignore failures here. The file from 1000genomes has a trailer
# that isn't recognized by gunzip. (This is not recommended practice!)
gunzip -c {{g1kv37}} > {{out}}/g1k_v37.fa || true
cd {{out}}
bwa index -a bwtsw g1k_v37.fa
"}
// Now that we have defined a reference, we can define a function to
// align a pair of reads against the reference, producing an output
// SAM-formatted file. Functions compute expressions over a set of
// abstract parameters, in this case, a pair of read files. Unlike almost
// everywhere else in Reflow, function parameters must be explicitly
// typed.
//
// (Note that we're using a syntactic short-hand here: parameter lists can
// be abbreviated. "r1, r2 file" is equivalent to "r1 file, r2 file".)
//
// The implementation of align is a straightforward invocation of "bwa mem".
// Note that "r1" and "r2" inside of the exec refer to the function arguments,
// thus align can be invoked for any set of r1, r2.
func align(r1, r2 file) =
exec(image := "biocontainers/bwa:v0.7.15_cv3", mem := 20*GiB, cpu := 16) (out file) {"
bwa mem -M -t 16 {{reference}}/g1k_v37.fa {{r1}} {{r2}} > {{out}}
"}
// We're ready to test our workflow now. We pick an arbitrary read
// pair from the 1000genomes data set, and invoke align. There are a
// few things of note here. First is the identifier "Main". This
// names the expression that's evaluated by `reflow run` -- the
// entry point of the computation. Second, we've defined Main to be
// a block. A block is an expression that contains one or more
// definitions followed by an expression. The value of a block is the
// final expression. Finally, Main contains a @requires annotation.
// This instructs Reflow how many resources to reserve for the work
// being done. Note that, because Reflow is able to distribute work,
// if a single instance is too small to execute fully in parallel,
// Reflow will provision additional compute instances to help along.
// @requires thus denotes the smallest possible instance
// configuration that's required for the program.
@requires(cpu := 16, mem := 24*GiB, disk := 50*GiB)
val Main = {
r1 := file("s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_1.filt.fastq.gz")
r2 := file("s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz")
align(r1, r2)
}
Now we're ready to run our module. First, let's run reflow doc
.This does two things. First, it typechecks the module (and anydependent modules), and second, it prints documentation for thepublic declarations in the module. Identifiers that begin with anuppercase letter are public (and may be used from other modules);others are not.
% reflow doc align.rf
Declarations
val Main (out file)
We're ready to test our workflow now. We pick an arbitrary read pair from the
1000genomes data set, and invoke align. There are a few things of note here.
First is the identifier "Main". This names the expression that's evaluated by
`reflow run` -- the entry point of the computation. Second, we've defined Main
to be a block. A block is an expression that contains one or more definitions
followed by an expression. The value of a block is the final expression. Finally,
Main contains a @requires annotation. This instructs Reflow how many resources
to reserve for the work being done. Note that, because Reflow is able to
distribute work, if a single instance is too small to execute fully in parallel,
Reflow will provision additional compute instances to help along. @requires thus
denotes the smallest possible instance configuration that's required for the
program.
Then let's run it:
% reflow run align.rf
reflow: run ID: 82e63a7a
ec2cluster: 1 instances: c5.4xlarge:1 (<=$0.7/hr), total{mem:29.8GiB cpu:16 disk:250.0GiB intel_avx512:16}, waiting{}, pending{}
82e63a7a: elapsed: 2m30s, executing:1, completed: 3/5
align.reference: exec ..101f9a082e1679c16d23787c532a0107537c9c # Ignore failures here. The f..bwa index -a bwtsw g1k_v37.fa 2m4s
Reflow launched a new instance: the previously launched instance (at2.small
) was not big enough to fit the requirements of align.rf.Note also that Reflow assigns a run name for each reflow run
invocation. This can be used to look up run details with the reflow info
command. In this case:
% reflow info 82e63a7a
82e63a7aee201d137f8ade3d584c234b856dc6bdeba00d5d6efc9627bd988a68 (run)
time: Wed Dec 12 10:45:04 2018
program: /Users/you/align.rf
phase: Eval
alloc: ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1
resources: {mem:28.9GiB cpu:16 disk:245.1GiB intel_avx:16 intel_avx2:16 intel_avx512:16}
log: /Users/you/.reflow/runs/82e63a7aee201d137f8ade3d584c234b856dc6bdeba00d5d6efc9627bd988a68.execlog
Here we see that the run is currently being performed on the alloc namedec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1
.An alloc is a resource reservation on a single machine. A run canmake use of multiple allocs to distribute work across multiplemachines. The alloc is a URI, and the first component is the realhostname. You can ssh into the host in order to inspect what's going on.Reflow launched the instance with your public SSH key (as long as it wasset up by reflow setup-ec2
, and $HOME/.ssh/id_rsa.pub
existed at that time).
% ssh core@ec2-34-213-42-76.us-west-2.compute.amazonaws.com
...
As the run progresses, Reflow prints execution status of each task on theconsole.
...
align.Main.r2: intern s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz 23s
align.Main.r1: intern done 1.8GiB 23s
align.g1kv37: intern done 851.0MiB 23s
align.reference: exec ..101f9a082e1679c16d23787c532a0107537c9c # Ignore failures here. The f..bwa index -a bwtsw g1k_v37.fa 6s
Here, Reflow started downloading r1 and r2 in parallel with creating the reference.Creating the reference is an expensive operation. We can examine it while it's runningwith reflow ps
:
% reflow ps
3674721e align.reference 10:46AM 0:00 running 4.4GiB 1.0 6.5GiB bwa
This tells us that the only task that's currently running is bwa to compute the reference.It's currently using 4.4GiB of memory, 1 cores, and 6.5GiB GiB of disk space. By passing the -loption, reflow ps also prints the task's exec URI.
% reflow ps -l
3674721e align.reference 10:46AM 0:00 running 4.4GiB 1.0 6.5GiB bwa ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e
An exec URI is a handle to the actual task being executed. Itglobally identifies all tasks, and can be examined with reflow info
:
% reflow info ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e
ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e (exec)
state: running
type: exec
ident: align.reference
image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c
cmd: "\n\t# Ignore failures here. The file from 1000genomes has a trailer\n\t# that isn't recognized by gunzip. (This is not recommended practice!)\n\tgunzip -c {{arg[0]}} > {{arg[1]}}/g1k_v37.fa || true\n\tcd {{arg[2]}}\n\tbwa index -a bwtsw g1k_v37.fa\n"
arg[0]:
.: sha256:8b6c538abf0dd92d3f3020f36cc1dd67ce004ffa421c2781205f1eb690bdb442 (851.0MiB)
arg[1]: output 0
arg[2]: output 0
top:
bwa index -a bwtsw g1k_v37.fa
Here, Reflow tells us that the currently running process is "bwaindex...", its template command, and the SHA256 digest of its inputs.Programs often print helpful output to standard error while working;this output can be examined with reflow logs
:
% reflow logs ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e
gzip: /arg/0/0: decompression OK, trailing garbage ignored
[bwa_index] Pack FASTA... 18.87 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6203609478, availableWord=448508744
[BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999990 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999990 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999990 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999990 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999990 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999990 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999990 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999999990 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1099999990 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1199999990 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1299999990 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1399999990 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1499999990 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1599999990 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1699999990 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1799999990 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1899999990 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1999999990 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 2099999990 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 2199999990 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 2299999990 characters processed.
[BWTIncConstructFromPacked] 240 iterations done. 2399999990 characters processed.
[BWTIncConstructFromPacked] 250 iterations done. 2499999990 characters processed.
[BWTIncConstructFromPacked] 260 iterations done. 2599999990 characters processed.
[BWTIncConstructFromPacked] 270 iterations done. 2699999990 characters processed.
[BWTIncConstructFromPacked] 280 iterations done. 2799999990 characters processed.
[BWTIncConstructFromPacked] 290 iterations done. 2899999990 characters processed.
[BWTIncConstructFromPacked] 300 iterations done. 2999999990 characters processed.
[BWTIncConstructFromPacked] 310 iterations done. 3099999990 characters processed.
[BWTIncConstructFromPacked] 320 iterations done. 3199999990 characters processed.
[BWTIncConstructFromPacked] 330 iterations done. 3299999990 characters processed.
[BWTIncConstructFromPacked] 340 iterations done. 3399999990 characters processed.
[BWTIncConstructFromPacked] 350 iterations done. 3499999990 characters processed.
[BWTIncConstructFromPacked] 360 iterations done. 3599999990 characters processed.
[BWTIncConstructFromPacked] 370 iterations done. 3699999990 characters processed.
[BWTIncConstructFromPacked] 380 iterations done. 3799999990 characters processed.
[BWTIncConstructFromPacked] 390 iterations done. 3899999990 characters processed.
[BWTIncConstructFromPacked] 400 iterations done. 3999999990 characters processed.
[BWTIncConstructFromPacked] 410 iterations done. 4099999990 characters processed.
[BWTIncConstructFromPacked] 420 iterations done. 4199999990 characters processed.
[BWTIncConstructFromPacked] 430 iterations done. 4299999990 characters processed.
[BWTIncConstructFromPacked] 440 iterations done. 4399999990 characters processed.
[BWTIncConstructFromPacked] 450 iterations done. 4499999990 characters processed.
At this point, it looks like everything is running as expected.There's not much more to do than wait. Note that, while creating anindex takes a long time, Reflow only has to compute it once. Whenit's done, Reflow memoizes the result, uploading the resulting datadirectly to the configured S3 cache bucket. The next time thereference expression is encountered, Reflow will use the previouslycomputed result. If the input file changes (e.g., we decide to useanother reference sequence), Reflow will recompute the index again.The same will happen if the command (or Docker image) that's used tocompute the index changes. Reflow keeps track of all the dependenciesfor a particular sub computation, and recomputes them only whendependencies have changed. This way, we always know what is beingcomputed is correct (the result is the same as if we had computed theresult from scratch), but avoid paying the cost of redundantcomputation.
After a little while, the reference will have finished generating,and Reflow begins alignment. Here, Reflow reports that the referencetook 52 minutes to compute, and produced 8 GiB of output.
align.reference: exec done 8.0GiB 52m37s
align.align: exec ..101f9a082e1679c16d23787c532a0107537c9c bwa mem -M -t 16 {{reference}..37.fa {{r1}} {{r2}} > {{out}} 4s
If we query ("info") the reference exec again, Reflow reports precisely whatwas produced:
% reflow info ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e
ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e (exec)
state: complete
type: exec
ident: align.reference
image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c
cmd: "\n\t# Ignore failures here. The file from 1000genomes has a trailer\n\t# that isn't recognized by gunzip. (This is not recommended practice!)\n\tgunzip -c {{arg[0]}} > {{arg[1]}}/g1k_v37.fa || true\n\tcd {{arg[2]}}\n\tbwa index -a bwtsw g1k_v37.fa\n"
arg[0]:
.: sha256:8b6c538abf0dd92d3f3020f36cc1dd67ce004ffa421c2781205f1eb690bdb442 (851.0MiB)
arg[1]: output 0
arg[2]: output 0
result:
list[0]:
g1k_v37.fa: sha256:2f9cd9e853a9284c53884e6a551b1c7284795dd053f255d630aeeb114d1fa81f (2.9GiB)
g1k_v37.fa.amb: sha256:dd51a07041a470925c1ebba45c2f534af91d829f104ade8fc321095f65e7e206 (6.4KiB)
g1k_v37.fa.ann: sha256:68928e712ef48af64c5b6a443f2d2b8517e392ae58b6a4ab7191ef7da3f7930e (6.7KiB)
g1k_v37.fa.bwt: sha256:2aec938930b8a2681eb0dfbe4f865360b98b2b6212c1fb9f7991bc74f72d79d8 (2.9GiB)
g1k_v37.fa.pac: sha256:d62039666da85d859a29ea24af55b3c8ffc61ddf02287af4d51b0647f863b94c (739.5MiB)
g1k_v37.fa.sa: sha256:99eb6ff6b54fba663c25e2642bb2a6c82921c931338a7144327c1e3ee99a4447 (1.4GiB)
In this case, "bwa index" produced a number of auxiliary indexfiles. These are the contents of the "reference" directory.
We can again query Reflow for running execs, and examine thealignment. We see now that the reference is passed in (argument 0),along side the read pairs (arguments 1 and 2).
% reflow ps -l
6a6c36f5 align.align 5:12PM 0:00 running 5.9GiB 12.3 0B bwa ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755
% reflow info ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755
ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755 (exec)
state: running
type: exec
ident: align.align
image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c
cmd: "\n\t\tbwa mem -M -t 16 {{arg[0]}}/g1k_v37.fa {{arg[1]}} {{arg[2]}} > {{arg[3]}}\n\t"
arg[0]:
g1k_v37.fa: sha256:2f9cd9e853a9284c53884e6a551b1c7284795dd053f255d630aeeb114d1fa81f (2.9GiB)
g1k_v37.fa.amb: sha256:dd51a07041a470925c1ebba45c2f534af91d829f104ade8fc321095f65e7e206 (6.4KiB)
g1k_v37.fa.ann: sha256:68928e712ef48af64c5b6a443f2d2b8517e392ae58b6a4ab7191ef7da3f7930e (6.7KiB)
g1k_v37.fa.bwt: sha256:2aec938930b8a2681eb0dfbe4f865360b98b2b6212c1fb9f7991bc74f72d79d8 (2.9GiB)
g1k_v37.fa.pac: sha256:d62039666da85d859a29ea24af55b3c8ffc61ddf02287af4d51b0647f863b94c (739.5MiB)
g1k_v37.fa.sa: sha256:99eb6ff6b54fba663c25e2642bb2a6c82921c931338a7144327c1e3ee99a4447 (1.4GiB)
arg[1]:
.: sha256:0c1f85aa9470b24d46d9fc67ba074ca9695d53a0dee580ec8de8ed46ef347a85 (1.8GiB)
arg[2]:
.: sha256:47f5e749123d8dda92b82d5df8e32de85273989516f8e575d9838adca271f630 (1.7GiB)
arg[3]: output 0
top:
/bin/bash -e -l -o pipefail -c ..bwa mem -M -t 16 /arg/0/0/g1k_v37.fa /arg/1/0 /arg/2/0 > /return/0 .
bwa mem -M -t 16 /arg/0/0/g1k_v37.fa /arg/1/0 /arg/2/0
Note that the read pairs are files. Files in Reflow do not havenames; they are just blobs of data. When Reflow runs a process thatrequires input files, those anonymous files are materialized on disk,but the filenames are not meaningful. In this case, we can see fromthe "top" output (these are the actual running processes, as reportedby the OS), that the r1 ended up being called "/arg/1/0" and r2"/arg/2/0". The output is a file named "/return/0".
Finally, alignment is complete. Aligning a single read pair tookaround 19m, and produced 13.2 GiB of output. Upon completion, Reflowprints runtime statistics and the result.
reflow: total n=5 time=1h9m57s
ident n ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
align.align 1 0 0B 17/17/17 15.6/15.6/15.6 7.8/7.8/7.8 12.9/12.9/12.9 0.0/0.0/0.0
align.Main.r2 1 0 0B
align.Main.r1 1 0 0B
align.reference 1 0 0B 51/51/51 1.0/1.0/1.0 4.4/4.4/4.4 6.5/6.5/6.5 0.0/0.0/0.0
align.g1kv37 1 0 0B
becb0485
Reflow represents file values by the SHA256 digest of the file'scontent. In this case, that's not very useful: you want the file,not its digest. Reflow provides mechanisms to export data. In thiscase let's copy the resulting file to an S3 bucket.
We'll make use of the "files" system module to copy the aligned fileto an external S3 bucket. Modify align.rf's Main
to the following(but pick an S3 bucket you own), and then run it again. Commentary isinline for clarity.
@requires(cpu := 16, mem := 24*GiB, disk := 50*GiB)
val Main = {
r1 := file("s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_1.filt.fastq.gz")
r2 := file("s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz")
// Instantiate the system modules "files" (system modules begin
// with $), assigning its instance to the "files" identifier. To
// view the documentation for this module, run `reflow doc
// $/files`.
files := make("$/files")
// As before.
aligned := align(r1, r2)
// Use the files module's Copy function to copy the aligned file to
// the provided destination.
files.Copy(aligned, "s3://marius-test-bucket/aligned.sam")
}
And run it again:
% reflow run align.rf
reflow: run ID: 9f0f3596
reflow: total n=2 time=1m9s
ident n ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)
align_2.align 1 1 0B
align_2.Main 1 0 13.2GiB
val<.=becb0485 13.2GiB>
Here we see that Reflow did not need to recompute the aligned file;it is instead retrieved from cache. The reference index generation isskipped altogether. Status lines that indicate "xfer" (instead of"run") means that Reflow is performing a cache transfer in place ofrunning the computation. Reflow claims to have transferred a 13.2 GiBfile to s3://marius-test-bucket/aligned.sam
. Indeed it did:
% aws s3 ls s3://marius-test-bucket/aligned.sam
2018-12-13 16:29:49 14196491221 aligned.sam.
This code was modularized and generalized in1000align. Here,fastq, bam, and alignment utilities are split into their ownparameterized modules. The toplevel module, 1000align, isinstantiated from the command line. Command line invocations (reflow run
) can pass module parameters through flags (strings, booleans,and integers):
% reflow run 1000align.rf -help
usage of 1000align.rf:
-out string
out is the target of the output merged BAM file (required)
-sample string
sample is the name of the 1000genomes phase 3 sample (required)
For example, to align the full sample from above, we can invoke1000align.rf with the following arguments:
% reflow run 1000align.rf -sample HG00103 -out s3://marius-test-bucket/HG00103.bam
In this case, if your account limits allow it, Reflow will launchadditional EC2 instances in order to further parallelize the work tobe done. (Since we're aligning multiple pairs of FASTQ files).In this run, we can see that Reflow is aligning 5 pairs in parallelacross 2 instances (four can fit on the initial m4.16xlarge instance).
% reflow ps -l
e74d4311 align.align.sam 11:45AM 0:00 running 10.9GiB 31.8 6.9GiB bwa ec2-34-210-201-193.us-west-2.compute.amazonaws.com:9000/6a7ffa00d6b0d9e1/e74d4311708f1c9c8d3894a06b59029219e8a545c69aa79c3ecfedc1eeb898f6
59c561be align.align.sam 11:45AM 0:00 running 10.9GiB 32.7 6.4GiB bwa ec2-34-210-201-193.us-west-2.compute.amazonaws.com:9000/6a7ffa00d6b0d9e1/59c561be5f627143108ce592d640126b88c23ba3d00974ad0a3c801a32b50fbe
ba688daf align.align.sam 11:47AM 0:00 running 8.7GiB 22.6 2.9GiB bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/ba688daf5d50db514ee67972ec5f0a684f8a76faedeb9a25ce3d412e3c94c75c
0caece7f align.align.sam 11:47AM 0:00 running 8.7GiB 25.9 3.4GiB bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/0caece7f38dc3d451d2a7411b1fcb375afa6c86a7b0b27ba7dd1f9d43d94f2f9
0b59e00c align.align.sam 11:47AM 0:00 running 10.4GiB 22.9 926.6MiB bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/0b59e00c848fa91e3b0871c30da3ed7e70fbc363bdc48fb09c3dfd61684c5fd9
When it completes, an approximately 17GiB BAM file is deposited to s3:
% aws s3 ls s3://marius-test-bucket/HG00103.bam
2018-12-14 15:27:33 18761607096 HG00103.bam.
Reflow comes with a built-in cluster manager, which is responsiblefor elastically increasing or decreasing required compute resources.The AWS EC2 cluster manager keeps track of instance type availabilityand account limits, and uses these to launch the most appropriate setof instances for a given job. When instances become idle, they willterminate themselves if they are idle for more than 10 minutes; idleinstances are reused when possible.
The cluster manager may be configured under the "ec2cluster" key inReflow's configuration. Its parameters are documented bygodoc.(Formal documentation is forthcoming.)
Reflow is implemented in Go, and its packages are go-gettable.Reflow is also a Go moduleand uses modules to fix its dependency graph.
After checking out the repository,the usual go
commands should work, e.g.:
% go test ./...
The package github.com/grailbio/reflow/cmd/reflow
(or subdirectory cmd/reflow
in the repository)defines the main command for Reflow.Because Reflow relies on being able todistribute its current build,the binary must be built using the buildreflow
toolinstead of the ordinary Go tooling.Command buildreflow
acts like go build
,but also cross compiles the binaryfor the remote target (Linux/amd64 currently supported),and embeds the cross-compiled binary.
% cd $CHECKOUT/cmd/reflow
% go install github.com/grailbio/reflow/cmd/buildreflow
% buildreflow
The $HOME/.reflow/runs
directory contains logs, traces and otherinformation for each Reflow run. If the run you're looking for isno longer there, the info
and cat
tools can be used if you havethe run ID:
% reflow info 2fd5a9b6
runid user start end ExecLog SysLog EvalGraph Trace
2fd5a9b6 username 4:41PM 4:41PM 29a4b506 41a8594d 90f40bfc 4ec75aac
% reflow cat 29a4b506 > /tmp/29a4b506.execlog
For more information about tracing, see: doc/tracing.md.
Please join us on on Gitter oron the mailing listto discuss Reflow.
文章目录 目录 前言 一、什么是reflow和repaint ? 二、reflow(回流)和repaint(重绘) 1.reflow(回流) 2.repaint(重绘) 3.尽量减少reflow 总结 前言 在浏览器的渲染过程中(页面初始化,用户行为改变界面样式,动画改变界面样式)reflow和repaint会大大影响web性能,特别是手机页面,因此我们在页面设计的时候要尽量减少reflow和re
1、什么是reflow? reflow(回流)是指浏览器为了重新渲染部分或者全部的文档,重新计算文档中的元素的位置和几何构造的过程。 因为回流可能导致整个Dom树的重新构造,所以是性能的一大杀手。 以下操作会引起回流: ① 改变窗口大小 ② font-size大小改变 ③ 增加或者移除样式表 ④ 内容变化(input中输入文字会导致) ⑤ 激活CSS伪类(:hover) ⑥ 操作class属性,新
浏览器的主要组件包括: 1. 用户界面 - 包括地址栏、前进/后退按钮、书签菜单等。除了浏览器主窗口显示的你请求的页面外,其他显示的各个部分都属于用户界面。 2. 浏览器引擎 - 在用户界面和渲染引擎之间传送指令。 3. 渲染引擎 - 负责显示请求的内容。如果请求的内容是 HTML,它就负责解析 HTML 和 CSS 内容,并将解析后的内容显示在屏幕上。 4.
字体大小改变(font size change) 窗口大小改变(screen change) 样式表添加或者删除(add/delete stylesheets) JS更改DOM元素(js changing dom) hover动作(:hover) 位置计算(offset cats) 用户输入(user input) 改变样式属性(changing class attribute) 都会触发refl
repaint就是重绘,reflow就是回流 严重性: 在性能优先的前提下,reflow的性能消耗要比repaint的大。 体现: repaint是某个dom元素进行重绘,reflow是整个页面进行重排,也就是对页面所有的dom元素渲染。 如何触发reflow和repaint repaint的触发: 1)不涉及任何dom元素的排版问题的变动为repa
如何最小化重绘(repaint)和回流(reflow)** DOM的增删行为 比如要给某个父元素添加子元素时,这类的操作都可以引起回流,如何添加多个子元素的时候,可以使用documentfragment 几何属性的变化 比如元素宽高发生了变化,border,字体大小发生了变化,这种直接会引起页面布局变化的操作也会引起回流。如果你要改变多个属性,最好将这些属性定义在一个class中,直接修改clas
浏览器触发reflow(回流)的操作 1、字体大小改变(font size change) 2、窗口大小改变(screen change) 3、样式表添加或者删除(add/delete stylesheets) 4、JS更改DOM元素(更改DOM颜色除外) 5、hover动作(:hover颜色除外) 6、位置计算(offset cats) 7、用户输入(user input) 8、改变样式属性(颜
重绘和回流会在我们设置节点样式时频繁出现,同时也会很大程度上影响性能。 重绘是当节点需要更改外观而不会影响布局的,比如改变 color 就叫称为重绘 回流是布局或者几何属性需要改变就称为回流。 回流必定会发生重绘,重绘不一定会引发回流。回流所需的成本比重绘高的多,改变父节点里的子节点很可能会导致父节点的一系列回流。 以下几个动作可能会导致性能问题: 改变 window 大小 改变字体 添加或删除样
大家知道前端的相关页面是怎么生成的吗? 我们都知道前端有三剑客,分别是html,css和JavaScript 其中html在页面生成的过程当中被解析中DOM树,而Css会被解析成CSS树,这两个树有机的结合在一起就变成了渲染树,顺便提一下render就是渲染的意思。这个过程就是所谓的Attachmenrt 接下来,生成我们想要的布局(flow),浏览器在屏幕上会画出我们render树的所有节点 最
1、想要了解回流和重绘先要知道html 加载时发生了什么 在页面加载时,浏览器把获取到的HTML代码解析成1个DOM树,DOM树里包含了所有HTML标签,包括display:none隐藏,还有用JS动态添加的元素等;之后解析CSS,生成CSSOM树。然后将DOM树和CSSOM树结合,生成渲染树(Render Tree)。 render tree类似于DOM tree,但区别很大,因为render
回流(reflow)与重绘(repaint) 一、概念 首先我们要明白的是,页面的显示过程分为以下几个阶段: 生成DOM树(包括display:none的节点) 在DOM树的基础上根据节点的集合属性(margin,padding,width,height等)生成render树(不包括display:none, head节点,但是包括visibility:hidden的节点) 在render树的基础
简要:整个在浏览器的渲染过程中(页面初始化,用户行为改变界面样式,动画改变界面样式等)reflow(回流)和repaint(重绘) 会大大影响web性能,尤其是手机页面。因此我们在页面设计的时候要尽量减少reflow和repaint。 什么是reflow和repaint(原文链接:http://www.cnblogs.com/Peng2014/p/4687218.html) reflow:例如某个
reflow和repaint的区别: reflow:浏览器需要去渲染,当它发现某个dom发生了变化不仅仅改变自身,会导致后面的元素位置全部改变并且影响了布局,则需要倒回去重新渲染; repaint:不影响元素周围或者内部布局,只改变某个dom元素的颜色或者文字颜色,则会进行重绘 什么的时候会发生reflow? 页面进行初始化的时候; 对dom元素进行操作的时候(删除或者增加); 改变dom元素的
Email:longsu2010 at yeah dot net 我现在的工作对页面的性能要求很高,这一年多以来对这方面有了更深刻的认识,早就想写一些关于这些内容的文章,今天抽时间先写repaint和reflow。 使用js操作DOM时repaint和reflow是经常发生的,如果处理不好这就是页面性能的瓶颈,表现出来现象可能是用户操作响应不及时,浏览器进程cpu特别高。 什么是repaint?
CSS 重绘 (Repaint) 和回流(Reflow) 什么是重绘(Repaint)? 什么是回流(重排)(Reflow)? 回流: 触发条件: 当我们对 DOM 结构的修改引发了 DOM 几何尺寸发生变化的时候, 就会发生回流的过程. 例如一下几个操作: 一个 DOM 元素的几何变化, 常见的几何属性 width,height,padding,margin,left,top,border 等等
重绘 元素外观改变,如 颜色、背景色,尺寸,定位不会改变,不会影响其他元素 重排 重新计算元素的尺寸和定位,可能会影响到其他元素的位置; 重排一定会引起重绘。 如何减少重排 集中修改样式,或直接切换 css class; 缩小修改范围:尽量修改元素本身而不是他的父元素; 修改样式之前先 display:none;使元素脱离正常文档流; 使用 BFC 特性,不影响其他元素; 频繁触发优化: 节流、防
以下是一些触发浏览器(reflow)的操作 1.字体大小 改变(font size change) 2.窗口大小 改变(screen resize) 3.样式表添加或删除(add/remove stylesheets) 4.JS更改DOM元素(js changing dom) 5.:hover动作(:hover) 6.位置计算(offset calcs) 7.用户输入(user input) 8.
什么是重绘(Repaint)?什么是回流(重排)(Reflow)? 回流: 触发条件:当我们对 DOM 结构的修改引发了 DOM 几何尺寸发生变化的时候,就会发生回流的过程。 例如一下几个操作: 一个 DOM 元素的几何变化,常见的几何属性 width、height、padding、margin、left、top、border 等等 使 DOM 节点发生 增减 或 移动。 读写 offset 族,
浏览器渲染过程 解析HTML生成DOM树。 解析CSS生成CSSOM规则树。 将DOM树与CSSOM规则树合并在一起生成渲染树。 遍历渲染树开始布局,计算每个节点的位置大小信息。 将渲染树每个节点绘制到屏幕。 当浏览器遇到一个 script 标记时,DOM 构建将暂停,直至脚本完成执行,然后继续构建DOM reflow(回流) 为了重新渲染部分或全部的文档而重新计算文档中元素的位置和几何结构的过程
写在前面 在讨论今天的主角之前,我们要先了解一下浏览器的渲染机制。以Google,Firefox,Safari为例,Firefox 使用Geoko——Mozilla 自主研发的渲染引擎,Safari 和Chrome 都使用 webkit。 我们主要以 Webkit的主流程为例 浏览器使用流式布局模型 (Flow Based Layout) 解析HTML 生成 DOM 树 解析CSS 生成CSSOM
H:Highways 总时间限制: 1000ms 内存限制: 65536kB描述The island nation of Flatopia is perfectly flat. Unfortunately, Flatopi ... [转]VS2010中如何创建一个WCF 本文转自:http://www.cnblogs.com/zhangliangzlee/arc
在CSS规范中有一个渲染对象的概念,通常用一个盒子(box, rectangle)来表示。mozilla通过一个叫frame的对象对盒子进行操作。frame主要的动作有三个: 构造frame, 以建立对象树(DOM树) reflow, 以确定对象位置,或者是调用mozilla的Layout(这里是指源码的实现) 绘制,以便对象能显示在屏幕上 总的来说,reflow就是载入内容树(在HTML中就是D
reflow 和 repaint repaint 就是重绘,reflow 就是回流。 严重性:在性能优先的前提下,性能消耗 reflow 大于 repaint。 体现:repaint 是某个 DOM 元素进行重绘;reflow 是整个页面进行重排,也就是页面所有 DOM 元素渲染。 如何触发:style 变动造成 repaint 和 reflow。 不涉及任何 DOM 元素的排版问题的变动为 re
本文向大家介绍会引起Reflow和Repaint的操作有哪些?相关面试题,主要包含被问及会引起Reflow和Repaint的操作有哪些?时的应答技巧和注意事项,需要的朋友参考一下 https://github.com/encountermm/learning-notes/blob/master/%E5%89%8D%E7%AB%AF%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%
本文向大家介绍如何减少浏览器的reflow和repaint,包括了如何减少浏览器的reflow和repaint的使用技巧和注意事项,需要的朋友参考一下 1.避免在document上直接进行频繁的DOM操作,如果确实需要可以采用off-document的方式进行,具体的方法包括但不完全包括以下几种: (1). 先将元素从document中删除,完成修改后再把元素放回原来的位置 (2). 将元素的di