This gem provides a generic lazy batching mechanism to avoid N+1 DB queries, HTTP queries, etc.
Developers from these companies use BatchLoader
:
loader
).Let's have a look at the code with N+1 queries:
def load_posts(ids)
Post.where(id: ids)
end
posts = load_posts([1, 2, 3]) # Posts SELECT * FROM posts WHERE id IN (1, 2, 3)
# _ ↓ _
# ↙ ↓ ↘
users = posts.map do |post| # U ↓ ↓ SELECT * FROM users WHERE id = 1
post.user # ↓ U ↓ SELECT * FROM users WHERE id = 2
end # ↓ ↓ U SELECT * FROM users WHERE id = 3
# ↘ ↓ ↙
# ¯ ↓ ¯
puts users # Users
The naive approach would be to preload dependent objects on the top level:
# With ORM in basic cases
def load_posts(ids)
Post.where(id: ids).includes(:user)
end
# But without ORM or in more complicated cases you will have to do something like:
def load_posts(ids)
# load posts
posts = Post.where(id: ids)
user_ids = posts.map(&:user_id)
# load users
users = User.where(id: user_ids)
user_by_id = users.each_with_object({}) { |user, memo| memo[user.id] = user }
# map user to post
posts.each { |post| post.user = user_by_id[post.user_id] }
end
posts = load_posts([1, 2, 3]) # Posts SELECT * FROM posts WHERE id IN (1, 2, 3)
# _ ↓ _ SELECT * FROM users WHERE id IN (1, 2, 3)
# ↙ ↓ ↘
users = posts.map do |post| # U ↓ ↓
post.user # ↓ U ↓
end # ↓ ↓ U
# ↘ ↓ ↙
# ¯ ↓ ¯
puts users # Users
But the problem here is that load_posts
now depends on the child association and knows that it has to preload data for future use. And it'll do it every time, even if it's not necessary. Can we do better? Sure!
With BatchLoader
we can rewrite the code above:
def load_posts(ids)
Post.where(id: ids)
end
def load_user(post)
BatchLoader.for(post.user_id).batch do |user_ids, loader|
User.where(id: user_ids).each { |user| loader.call(user.id, user) }
end
end
posts = load_posts([1, 2, 3]) # Posts SELECT * FROM posts WHERE id IN (1, 2, 3)
# _ ↓ _
# ↙ ↓ ↘
users = posts.map do |post| # BL ↓ ↓
load_user(post) # ↓ BL ↓
end # ↓ ↓ BL
# ↘ ↓ ↙
# ¯ ↓ ¯
puts users # Users SELECT * FROM users WHERE id IN (1, 2, 3)
As we can see, batching is isolated and described right in a place where it's needed.
In general, BatchLoader
returns a lazy object. Each lazy object knows which data it needs to load and how to batch the query. As soon as you need to use the lazy objects, they will be automatically loaded once without N+1 queries.
So, when we call BatchLoader.for
we pass an item (user_id
) which should be collected and used for batching later. For the batch
method, we pass a block which will use all the collected items (user_ids
):
BatchLoader.for(post.user_id).batch do |user_ids, loader| ... end
Inside the block we execute a batch query for our items (User.where
). After that, all we have to do is to call loader
by passing an item which was used in BatchLoader.for
method (user_id
) and the loaded object itself (user
):
BatchLoader.for(post.user_id).batch do |user_ids, loader| User.where(id: user_ids).each { |user| loader.call(user.id, user) } end
When we call any method on the lazy object, it'll be automatically loaded through batching for all instantiated BatchLoader
s:
puts users # => SELECT * FROM users WHERE id IN (1, 2, 3)
For more information, see the Implementation details section.
Now imagine we have a regular Rails app with N+1 HTTP requests:
# app/models/post.rb
class Post < ApplicationRecord
def rating
HttpClient.request(:get, "https://example.com/ratings/#{id}")
end
end
# app/controllers/posts_controller.rb
class PostsController < ApplicationController
def index
posts = Post.limit(10)
serialized_posts = posts.map { |post| {id: post.id, rating: post.rating} } # N+1 HTTP requests for each post.rating
render json: serialized_posts
end
end
As we can see, the code above will make N+1 HTTP requests, one for each post. Let's batch the requests with a gem called parallel:
class Post < ApplicationRecord
def rating_lazy
BatchLoader.for(post).batch do |posts, loader|
Parallel.each(posts, in_threads: 10) { |post| loader.call(post, post.rating) }
end
end
# ...
end
loader
is thread-safe. So, if HttpClient
is also thread-safe, then with parallel
gem we can execute all HTTP requests concurrently in threads (there are some benchmarks for concurrent HTTP requests in Ruby). Thanks to Matz, MRI releases GIL when thread hits blocking I/O – HTTP request in our case.
In the controller, all we have to do is to replace post.rating
with the lazy post.rating_lazy
:
class PostsController < ApplicationController
def index
posts = Post.limit(10)
serialized_posts = posts.map { |post| {id: post.id, rating: post.rating_lazy} }
render json: serialized_posts
end
end
BatchLoader
caches the loaded values. To ensure that the cache is purged between requests in the app add the following middleware to your config/application.rb
:
config.middleware.use BatchLoader::Middleware
See the Caching section for more information.
Batching is particularly useful with GraphQL. Using such techniques as preloading data in advance to avoid N+1 queries can be very complicated, since a user can ask for any available fields in a query.
Let's take a look at the simple graphql-ruby schema example:
class MyProjectSchema < GraphQL::Schema
query Types::QueryType
end
module Types
class QueryType < Types::BaseObject
field :posts, [PostType], null: false
def posts
Post.all
end
end
end
module Types
class PostType < Types::BaseObject
name "Post"
field :user, UserType, null: false
def user
object.user # N+1 queries
end
end
end
module Types
class UserType < Types::BaseObject
name "User"
field :name, String, null: false
end
end
If we want to execute a simple query like the following, we will get N+1 queries for each post.user
:
query = "
{
posts {
user {
name
}
}
}
"
MyProjectSchema.execute(query)
To avoid this problem, all we have to do is to change the resolver to return BatchLoader::GraphQL
(#32 explains why not just BatchLoader
):
module Types
class PostType < Types::BaseObject
name "Post"
field :user, UserType, null: false
def user
BatchLoader::GraphQL.for(object.user_id).batch do |user_ids, loader|
User.where(id: user_ids).each { |user| loader.call(user.id, user) }
end
end
end
end
And setup GraphQL to use the built-in lazy_resolve
method:
class MyProjectSchema < GraphQL::Schema
query Types::QueryType
use BatchLoader::GraphQL
end
That's it.
For batches where there is no item in response to a call, we normally return nil
. However, you can use :default_value
to return something else instead:
BatchLoader.for(post.user_id).batch(default_value: NullUser.new) do |user_ids, loader|
User.where(id: user_ids).each { |user| loader.call(user.id, user) }
end
For batches where the value is some kind of collection, such as an Array or Hash, loader
also supports being called with a block, which yields the current value, and returns the next value. This is extremely useful for 1:Many (has_many
) relationships:
BatchLoader.for(user.id).batch(default_value: []) do |user_ids, loader|
Comment.where(user_id: user_ids).each do |comment|
loader.call(comment.user_id) { |memo| memo << comment }
end
end
It's possible to reuse the same BatchLoader#batch
block for loading different types of data by specifying a unique key
.For example, with polymorphic associations:
def lazy_association(post)
id = post.association_id
key = post.association_type
BatchLoader.for(id).batch(key: key) do |ids, loader, args|
model = Object.const_get(args[:key])
model.where(id: ids).each { |record| loader.call(record.id, record) }
end
end
post1 = Post.save(association_id: 1, association_type: 'Tag')
post2 = Post.save(association_id: 1, association_type: 'Category')
lazy_association(post1) # SELECT * FROM tags WHERE id IN (1)
lazy_association(post2) # SELECT * FROM categories WHERE id IN (1)
It's also required to pass custom key
when using BatchLoader
with metaprogramming (e.g. eval
).
By default BatchLoader
caches the loaded values. You can test it by running something like:
def user_lazy(id)
BatchLoader.for(id).batch do |ids, loader|
User.where(id: ids).each { |user| loader.call(user.id, user) }
end
end
puts user_lazy(1) # SELECT * FROM users WHERE id IN (1)
# => <#User:...>
puts user_lazy(1) # no request
# => <#User:...>
Usually, it's just enough to clear the cache between HTTP requests in the app. To do so, simply add the middleware:
use BatchLoader::Middleware
To drop the cache manually you can run:
puts user_lazy(1) # SELECT * FROM users WHERE id IN (1)
puts user_lazy(1) # no request
BatchLoader::Executor.clear_current
puts user_lazy(1) # SELECT * FROM users WHERE id IN (1)
In some rare cases it's useful to disable caching for BatchLoader
. For example, in tests or after data mutations:
def user_lazy(id)
BatchLoader.for(id).batch(cache: false) do |ids, loader|
# ...
end
end
puts user_lazy(1) # SELECT * FROM users WHERE id IN (1)
puts user_lazy(1) # SELECT * FROM users WHERE id IN (1)
If you set cache: false
, it's likely you also want replace_methods: false
(see below section).
By default, BatchLoader
replaces methods on its instance by calling #define_method
after batching to copy methods from the loaded value.This consumes some time but allows to speed up any future method calls on the instance.In some cases, when there are a lot of instances with a huge number of defined methods, this initial process of replacing the methods can be slow.You may consider avoiding the "up front payment" and "pay as you go" with #method_missing
by disabling the method replacement:
BatchLoader.for(id).batch(replace_methods: false) do |ids, loader|
# ...
end
Add this line to your application's Gemfile:
gem 'batch-loader'
And then execute:
$ bundle
Or install it yourself as:
$ gem install batch-loader
BatchLoader.for(item).batch(
default_value: default_value,
cache: cache,
replace_methods: replace_methods,
key: key
) do |items, loader, args|
# ...
end
Argument Key | Default | Description |
---|---|---|
item |
- | Item which will be collected and used for batching. |
default_value |
nil |
Value returned by default after batching. |
cache |
true |
Set false to disable caching between the same executions. |
replace_methods |
true |
Set false to use #method_missing instead of replacing the methods after batching. |
key |
nil |
Pass custom key to uniquely identify the batch block. |
items |
- | List of collected items for batching. |
loader |
- | Lambda which should be called to load values loaded in batch. |
args |
{default_value: nil, cache: true, replace_methods: true, key: nil} |
Arguments passed to the batch method. |
These gems are built by using BatchLoader
:
BatchLoader
in other programming languages:
See the slides [37-42].
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/exAspArk/batch-loader. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
There are some other Ruby implementations for batching such as:
However, batch-loader
has some differences:
batch
method.nil
values for the missing ones. Instead, it provides the loader
lambda which simply maps an item to the loaded object.The gem is available as open source under the terms of the MIT License.
Everyone interacting in the Batch::Loader project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
用pytorch进行批训练其实很简单,只要把数据放入DataLoader(可以把它看成一个收纳柜,它会帮你整理好) 大概步骤: 生成X,Y数据 将X,Y数据转为dataset dataset = Data.TensorDataset(X,Y) 将dataset放入DataLoader中 loader = Data.DataLoader( dataset=dataset, batch_s
date-loader的使用 1、data-loader使用简介 data-loader 是一个简易的命令行工具, 连接源数据库与目标数据库,可在 Windows 与 Linux 下运行,使用前需要安装 Java 运行环境,并设置环境变量 JAVA_HOME。 date-loader不需要安装,直接解压就可以使用,一台机子上可以使用多个data-loader,可以并行执行 data-loader支
在上一篇(拼拼凑凑的pytorch学习——神经网络训练)中我们说到过,pytorch中SGD优化器会使用全部传入的数据来计算梯度,所以如果传入了所有数据,那么就是相当于批量梯度下降,那么如果实现mini-batch梯度下降以及随机梯度下降呢?可以从数据供给的角度去考虑。 这里仍旧使用上一篇中的例子 mini-batch梯度下降 mini-batch梯度下降,就是将数据分为多个批次,每次投入一批数据
已实现:根据路径加载图像和目标,即 train_loder内的数据样式对应 [image , target] 现目标:将数据输入到模型中进一步处理 做法:传统样式,不过多叙述。 for i, batch in tqdm(enumerate(train_loader), desc=f "Epoch: {epoch + 1}/{epochs}. Loop: Train", total=len(trai
我开发了一个spring批处理,用于将用户加载到DB中。我能够从提供主类的eclipse中作为Java应用程序运行它,但不能通过unix服务器上的shell脚本运行批处理。我得到了这个例外: 我在job.xml文件中声明了bean,如下所示
您可以创建可执行的JAR文件,并使用Maven或Gradle命令运行Spring Boot应用程序,如下所示 - 对于Maven,您可以使用下面给出的命令 - mvn clean install 在“BUILD SUCCESS”之后,您可以在目标目录下找到JAR文件。 对于Gradle,您可以使用如下所示的命令 - gradle clean build 在“BUILD SUCCESSFUL”之
以下示例将演示如何使用Spring JDBC进行批量更新。 我们将在单个批处理操作中更新Student表中的可用记录。 语法 (Syntax) String SQL = "update Student set age = ? where id = ?"; int[] updateCounts = jdbcTemplateObject.batchUpdate(SQL, new BatchPre
以下示例将演示如何使用Spring JDBC进行批量更新。 我们将在单个批处理操作中更新Student表中的可用记录。 语法 (Syntax) String SQL = "update Student set age = ? where id = ?"; int[] updateCounts = jdbcTemplateObject.batchUpdate(SQL, new BatchPre
Batch Training Running algorithms which require the full data set for each update can be expensive when the data is large. In order to scale inferences, we can do batch training. This trains the model
分布式批量架构是解决复杂业务处理、数据分析、科学计算等耗时工作任务,这些任务被安排在特定的服务器上执行,被统一规划、拆分成子任务、统一调度、并发执行,大大提高了执行效率和可靠性,另外异步批量架构可以部署在网络内的任意服务器或PC上形成企业的云计算。 基本原理:举例说明异步批量的最基本原理,实际应用要比这复杂得多。比如:某企业要生成年终表报,并且年终报表要按照企业的各个部门、分公司出,报表的数据来源