当前位置: 首页 > 软件库 > 大数据 > 数据查询 >

dataloader

授权协议 MIT License
开发语言 Java
所属分类 大数据、 数据查询
软件类型 开源软件
地区 不详
投 递 者 吴兴国
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

DataLoader

DataLoader is a generic utility to be used as part of your application's datafetching layer to provide a simplified and consistent API over various remotedata sources such as databases or web services via batching and caching.

A port of the "Loader" API originally developed by @schrockn at Facebook in2010 as a simplifying force to coalesce the sundry key-value store back-endAPIs which existed at the time. At Facebook, "Loader" became one of theimplementation details of the "Ent" framework, a privacy-aware data entityloading and caching layer within web server product code. This ultimately becamethe underpinning for Facebook's GraphQL server implementation and typedefinitions.

DataLoader is a simplified version of this original idea implemented inJavaScript for Node.js services. DataLoader is often used when implementing agraphql-js service, though it is also broadly useful in other situations.

This mechanism of batching and caching data requests is certainly not unique toNode.js or JavaScript, it is also the primary motivation forHaxl, Facebook's data loading libraryfor Haskell. More about how Haxl works can be read in this blog post.

DataLoader is provided so that it may be useful not just to build GraphQLservices for Node.js but also as a publicly available reference implementationof this concept in the hopes that it can be ported to other languages. If youport DataLoader to another language, please open an issue to include a link fromthis repository.

Getting Started

First, install DataLoader using npm.

npm install --save dataloader

To get started, create a DataLoader. Each DataLoader instance represents aunique cache. Typically instances are created per request when used within aweb-server like express if different users can see different things.

Note: DataLoader assumes a JavaScript environment with global ES6 Promiseand Map classes, available in all supported versions of Node.js.

Batching

Batching is not an advanced feature, it's DataLoader's primary feature.Create loaders by providing a batch loading function.

const DataLoader = require('dataloader')

const userLoader = new DataLoader(keys => myBatchGetUsers(keys))

A batch loading function accepts an Array of keys, and returns a Promise whichresolves to an Array of values*.

Then load individual values from the loader. DataLoader will coalesce allindividual loads which occur within a single frame of execution (a single tickof the event loop) and then call your batch function with all requested keys.

const user = await userLoader.load(1)
const invitedBy = await userLoader.load(user.invitedByID)
console.log(`User 1 was invited by ${invitedBy}`)

// Elsewhere in your application
const user = await userLoader.load(2)
const lastInvited = await userLoader.load(user.lastInvitedID)
console.log(`User 2 last invited ${lastInvited}`)

A naive application may have issued four round-trips to a backend for therequired information, but with DataLoader this application will make at mosttwo.

DataLoader allows you to decouple unrelated parts of your application withoutsacrificing the performance of batch data-loading. While the loader presents anAPI that loads individual values, all concurrent requests will be coalesced andpresented to your batch loading function. This allows your application to safelydistribute data fetching requirements throughout your application and maintainminimal outgoing data requests.

Batch Function

A batch loading function accepts an Array of keys, and returns a Promise whichresolves to an Array of values or Error instances. The loader itself is providedas the this context.

async function batchFunction(keys) {
  const results = await db.fetchAllKeys(keys)
  return keys.map(key => results[key] || new Error(`No result for ${key}`))
}

const loader = new DataLoader(batchFunction)

There are a few constraints this function must uphold:

  • The Array of values must be the same length as the Array of keys.
  • Each index in the Array of values must correspond to the same index in the Array of keys.

For example, if your batch function was provided the Array of keys: [ 2, 9, 6, 1 ],and loading from a back-end service returned the values:

{ id: 9, name: 'Chicago' }
{ id: 1, name: 'New York' }
{ id: 2, name: 'San Francisco' }

Our back-end service returned results in a different order than we requested, likelybecause it was more efficient for it to do so. Also, it omitted a result for key 6,which we can interpret as no value existing for that key.

To uphold the constraints of the batch function, it must return an Array of valuesthe same length as the Array of keys, and re-order them to ensure each index alignswith the original keys [ 2, 9, 6, 1 ]:

[
  { id: 2, name: 'San Francisco' },
  { id: 9, name: 'Chicago' },
  null, // or perhaps `new Error()`
  { id: 1, name: 'New York' }
]

Batch Scheduling

By default DataLoader will coalesce all individual loads which occur within asingle frame of execution before calling your batch function with all requestedkeys. This ensures no additional latency while capturing many related requestsinto a single batch. In fact, this is the same behavior used in Facebook'soriginal PHP implementation in 2010. See enqueuePostPromiseJob in thesource code for more details about how this works.

However sometimes this behavior is not desirable or optimal. Perhaps you expectrequests to be spread out over a few subsequent ticks because of an existing useof setTimeout, or you just want manual control over dispatching regardless ofthe run loop. DataLoader allows providing a custom batch scheduler to providethese or any other behaviors.

A custom scheduler is provided as batchScheduleFn in options. It must be afunction which is passed a callback and is expected to call that callback in theimmediate future to execute the batch request.

As an example, here is a batch scheduler which collects all requests over a100ms window of time (and as a consequence, adds 100ms of latency):

const myLoader = new DataLoader(myBatchFn, {
  batchScheduleFn: callback => setTimeout(callback, 100)
})

As another example, here is a manually dispatched batch scheduler:

function createScheduler() {
  let callbacks = []
  return {
    schedule(callback) {
      callbacks.push(callback)
    },
    dispatch() {
      callbacks.forEach(callback => callback())
      callbacks = []
    }
  }
}

const { schedule, dispatch } = createScheduler()
const myLoader = new DataLoader(myBatchFn, { batchScheduleFn: schedule })

myLoader.load(1)
myLoader.load(2)
dispatch()

Caching

DataLoader provides a memoization cache for all loads which occur in a singlerequest to your application. After .load() is called once with a given key,the resulting value is cached to eliminate redundant loads.

Caching Per-Request

DataLoader caching does not replace Redis, Memcache, or any other sharedapplication-level cache. DataLoader is first and foremost a data loading mechanism,and its cache only serves the purpose of not repeatedly loading the same data inthe context of a single request to your Application. To do this, it maintains asimple in-memory memoization cache (more accurately: .load() is a memoized function).

Avoid multiple requests from different users using the DataLoader instance, whichcould result in cached data incorrectly appearing in each request. Typically,DataLoader instances are created when a Request begins, and are not used once theRequest ends.

For example, when using with express:

function createLoaders(authToken) {
  return {
    users: new DataLoader(ids => genUsers(authToken, ids)),
  }
}

const app = express()

app.get('/', function(req, res) {
  const authToken = authenticateUser(req)
  const loaders = createLoaders(authToken)
  res.send(renderPage(req, loaders))
})

app.listen()

Caching and Batching

Subsequent calls to .load() with the same key will result in that key notappearing in the keys provided to your batch function. However, the resultingPromise will still wait on the current batch to complete. This way both cachedand uncached requests will resolve at the same time, allowing DataLoaderoptimizations for subsequent dependent loads.

In the example below, User 1 happens to be cached. However, because User 1and 2 are loaded in the same tick, they will resolve at the same time. Thismeans both user.bestFriendID loads will also happen in the same tick whichresults in two total requests (the same as if User 1 had not been cached).

userLoader.prime(1, { bestFriend: 3 })

async function getBestFriend(userID) {
  const user = await userLoader.load(userID)
  return await userLoader.load(user.bestFriendID)
}

// In one part of your application
getBestFriend(1)

// Elsewhere
getBestFriend(2)

Without this optimization, if the cached User 1 resolved immediately, thiscould result in three total requests since each user.bestFriendID load wouldhappen at different times.

Clearing Cache

In certain uncommon cases, clearing the request cache may be necessary.

The most common example when clearing the loader's cache is necessary is aftera mutation or update within the same request, when a cached value could be out ofdate and future loads should not use any possibly cached value.

Here's a simple example using SQL UPDATE to illustrate.

// Request begins...
const userLoader = new DataLoader(...)

// And a value happens to be loaded (and cached).
const user = await userLoader.load(4)

// A mutation occurs, invalidating what might be in cache.
await sqlRun('UPDATE users WHERE id=4 SET username="zuck"')
userLoader.clear(4)

// Later the value load is loaded again so the mutated data appears.
const user = await userLoader.load(4)

// Request completes.

Caching Errors

If a batch load fails (that is, a batch function throws or returns a rejectedPromise), then the requested values will not be cached. However if a batchfunction returns an Error instance for an individual value, that Error willbe cached to avoid frequently loading the same Error.

In some circumstances you may wish to clear the cache for these individual Errors:

try {
  const user = await userLoader.load(1)
} catch (error) {
  if (/* determine if the error should not be cached */) {
    userLoader.clear(1)
  }
  throw error
}

Disabling Cache

In certain uncommon cases, a DataLoader which does not cache may be desirable.Calling new DataLoader(myBatchFn, { cache: false }) will ensure that everycall to .load() will produce a new Promise, and requested keys will not besaved in memory.

However, when the memoization cache is disabled, your batch function willreceive an array of keys which may contain duplicates! Each key will beassociated with each call to .load(). Your batch loader should provide a valuefor each instance of the requested key.

For example:

const myLoader = new DataLoader(keys => {
  console.log(keys)
  return someBatchLoadFn(keys)
}, { cache: false })

myLoader.load('A')
myLoader.load('B')
myLoader.load('A')

// > [ 'A', 'B', 'A' ]

More complex cache behavior can be achieved by calling .clear() or .clearAll()rather than disabling the cache completely. For example, this DataLoader willprovide unique keys to a batch function due to the memoization cache beingenabled, but will immediately clear its cache when the batch function is calledso later requests will load new values.

const myLoader = new DataLoader(keys => {
  identityLoader.clearAll()
  return someBatchLoadFn(keys)
})

Custom Cache

As mentioned above, DataLoader is intended to be used as a per-request cache.Since requests are short-lived, DataLoader uses an infinitely growing Map asa memoization cache. This should not pose a problem as most requests areshort-lived and the entire cache can be discarded after the request completes.

However this memoization caching strategy isn't safe when using a long-livedDataLoader, since it could consume too much memory. If using DataLoader in thisway, you can provide a custom Cache instance with whatever behavior you prefer,as long as it follows the same API as Map.

The example below uses an LRU (least recently used) cache to limit total memoryto hold at most 100 cached values via the lru_map npm package.

import { LRUMap } from 'lru_map'

const myLoader = new DataLoader(someBatchLoadFn, {
  cacheMap: new LRUMap(100)
})

More specifically, any object that implements the methods get(), set(),delete() and clear() methods can be provided. This allows for custom Mapswhich implement various cache algorithms to be provided.

API

class DataLoader

DataLoader creates a public API for loading data from a particulardata back-end with unique keys such as the id column of a SQL table ordocument name in a MongoDB database, given a batch loading function.

Each DataLoader instance contains a unique memoized cache. Use caution whenused in long-lived applications or those which serve many users with differentaccess permissions and consider creating a new instance per web request.

new DataLoader(batchLoadFn [, options])

Create a new DataLoader given a batch loading function and options.

  • batchLoadFn: A function which accepts an Array of keys, and returns aPromise which resolves to an Array of values.

  • options: An optional object of options:

    Option Key Type Default Description
    batch Boolean true Set to false to disable batching, invoking batchLoadFn with a single load key. This is equivalent to setting maxBatchSize to 1.
    maxBatchSize Number Infinity Limits the number of items that get passed in to the batchLoadFn. May be set to 1 to disable batching.
    batchScheduleFn Function See Batch scheduling A function to schedule the later execution of a batch. The function is expected to call the provided callback in the immediate future.
    cache Boolean true Set to false to disable memoization caching, creating a new Promise and new key in the batchLoadFn for every load of the same key. This is equivalent to setting cacheMap to null.
    cacheKeyFn Function key => key Produces cache key for a given load key. Useful when objects are keys and two objects should be considered equivalent.
    cacheMap Object new Map() Instance of Map (or an object with a similar API) to be used as cache. May be set to null to disable caching.
load(key)

Loads a key, returning a Promise for the value represented by that key.

  • key: A key value to load.
loadMany(keys)

Loads multiple keys, promising an array of values:

const [ a, b ] = await myLoader.loadMany([ 'a', 'b' ])

This is similar to the more verbose:

const [ a, b ] = await Promise.all([
  myLoader.load('a'),
  myLoader.load('b')
])

However it is different in the case where any load fails. WherePromise.all() would reject, loadMany() always resolves, however each resultis either a value or an Error instance.

var [ a, b, c ] = await myLoader.loadMany([ 'a', 'b', 'badkey' ]);
// c instanceof Error
  • keys: An array of key values to load.
clear(key)

Clears the value at key from the cache, if it exists. Returns itself formethod chaining.

  • key: A key value to clear.
clearAll()

Clears the entire cache. To be used when some event results in unknowninvalidations across this particular DataLoader. Returns itself formethod chaining.

prime(key, value)

Primes the cache with the provided key and value. If the key already exists, nochange is made. (To forcefully prime the cache, clear the key first withloader.clear(key).prime(key, value).) Returns itself for method chaining.

To prime the cache with an error at a key, provide an Error instance.

Using with GraphQL

DataLoader pairs nicely well with GraphQL. GraphQL fields aredesigned to be stand-alone functions. Without a caching or batching mechanism,it's easy for a naive GraphQL server to issue new database requests each time afield is resolved.

Consider the following GraphQL request:

{
  me {
    name
    bestFriend {
      name
    }
    friends(first: 5) {
      name
      bestFriend {
        name
      }
    }
  }
}

Naively, if me, bestFriend and friends each need to request the backend,there could be at most 13 database requests!

When using DataLoader, we could define the User type using theSQLite example with clearer code and at most 4 database requests,and possibly fewer if there are cache hits.

const UserType = new GraphQLObjectType({
  name: 'User',
  fields: () => ({
    name: { type: GraphQLString },
    bestFriend: {
      type: UserType,
      resolve: user => userLoader.load(user.bestFriendID)
    },
    friends: {
      args: {
        first: { type: GraphQLInt }
      },
      type: new GraphQLList(UserType),
      resolve: async (user, { first }) => {
        const rows = await queryLoader.load([
          'SELECT toID FROM friends WHERE fromID=? LIMIT ?', user.id, first
        ])
        return rows.map(row => userLoader.load(row.toID))
      }
    }
  })
})

Common Patterns

Creating a new DataLoader per request.

In many applications, a web server using DataLoader serves requests to manydifferent users with different access permissions. It may be dangerous to useone cache across many users, and is encouraged to create a new DataLoaderper request:

function createLoaders(authToken) {
  return {
    users: new DataLoader(ids => genUsers(authToken, ids)),
    cdnUrls: new DataLoader(rawUrls => genCdnUrls(authToken, rawUrls)),
    stories: new DataLoader(keys => genStories(authToken, keys)),
  }
}

// When handling an incoming web request:
const loaders = createLoaders(request.query.authToken)

// Then, within application logic:
const user = await loaders.users.load(4)
const pic = await loaders.cdnUrls.load(user.rawPicUrl)

Creating an object where each key is a DataLoader is one common pattern whichprovides a single value to pass around to code which needs to performdata loading, such as part of the rootValue in a graphql-js request.

Loading by alternative keys.

Occasionally, some kind of value can be accessed in multiple ways. For example,perhaps a "User" type can be loaded not only by an "id" but also by a "username"value. If the same user is loaded by both keys, then it may be useful to fillboth caches when a user is loaded from either source:

const userByIDLoader = new DataLoader(async ids => {
  const users = await genUsersByID(ids)
  for (let user of users) {
    usernameLoader.prime(user.username, user)
  }
  return users
})

const usernameLoader = new DataLoader(async names => {
  const users = await genUsernames(names)
  for (let user of users) {
    userByIDLoader.prime(user.id, user)
  }
  return users
})

Freezing results to enforce immutability

Since DataLoader caches values, it's typically assumed these values will betreated as if they were immutable. While DataLoader itself doesn't enforcethis, you can create a higher-order function to enforce immutabilitywith Object.freeze():

function freezeResults(batchLoader) {
  return keys => batchLoader(keys).then(values => values.map(Object.freeze))
}

const myLoader = new DataLoader(freezeResults(myBatchLoader))

Batch functions which return Objects instead of Arrays

DataLoader expects batch functions which return an Array of the same length asthe provided keys. However this is not always a common return format from otherlibraries. A DataLoader higher-order function can convert from one format to another. The example below converts a { key: value } result to the formatDataLoader expects.

function objResults(batchLoader) {
  return keys => batchLoader(keys).then(objValues => keys.map(
    key => objValues[key] || new Error(`No value for ${key}`)
  ))
}

const myLoader = new DataLoader(objResults(myBatchLoader))

Common Back-ends

Looking to get started with a specific back-end? Try the loaders in the examples directory.

Other Implementations

Listed in alphabetical order

Video Source Code Walkthrough

DataLoader Source Code Walkthrough (YouTube):

A walkthrough of the DataLoader v1 source code. While the source has changedsince this video was made, it is still a good overview of the rationale ofDataLoader and how it works.

Contributing to this repo

This repository is managed by EasyCLA. Project participants must sign the free (GraphQL Specification Membership agreement before making a contribution. You only need to do this one time, and it can be signed by individual contributors or their employers.

To initiate the signature process please open a PR against this repo. The EasyCLA bot will block the merge if we still need a membership agreement from you.

You can find detailed information here. If you have issues, please email operations@graphql.org.

If your company benefits from GraphQL and you would like to provide essential financial support for the systems and people that power our community, please also consider membership in the GraphQL Foundation.

  • 官方解释:Dataloader 组合了 dataset & sampler,提供在数据上的 iterable 主要参数: 1、dataset:这个dataset一定要是torch.utils.data.Dataset本身或继承自它的类 里面最主要的方法是 __getitem__(self, index) 用于根据index索引来取数据的 2、batch_size:每个batch批次要返回几条数据

  • 本博客讲解了pytorch框架下DataLoader的多种用法,每一种方法都展示了实例,虽然有一点复杂,但是小伙伴静下心看一定能看懂哦 :) 个人建议,在1.1.1节介绍的三种方法中,推荐 方法二>方法一>方法三 (方法三实在是过于复杂不做推荐),另外,第三节中的处理示例使用了非DataLoader的方法进行数据集处理,也可以借鉴~

  •    在深度学习加载模型的时候,会对数据进行处理,今天主要介绍pytorch中Dateset和DataLoader的使用方法。 目录 一、基础概念 二、Dataset使用方法 1.torch.utils.data里面的dataset使用方法 2.torchvision.datasets的使用方法 三、DateLoader详解 一、基础概念   1.torch.utils.data.datasets

  • Dataset,Dataloader详解 Dataset,Dataloader是什么? Dataset:负责可被Pytorch使用的数据集的创建 Dataloader:向模型中传递数据 为什么要了解Dataloader ​ 因为你的神经网络表现不佳的主要原因之一可能是由于数据不佳或理解不足。 因此,以更直观的方式理解、预处理数据并将其加载到网络中非常重要。 ​ 通常,我们在默认或知名数据集(如 M

  • 目录 1.torch.utils.data.DataLoader概念介绍 2.torch.utils.data.DataLoader参数介绍 3 案例体会 DataLoader:[batch_sizeN, C, H, W]+[Target1, Target2, ......., TargetN] ? 1.torch.utils.data.DataLoader概念介绍 Data loader. Co

  • 本文主要使用CIFAR10数据集来讲解Dataloader的使用方法,并写入tensorboard中,可以更好的去查看。 目录 前言 一、DataLoader类的官方解释 二、使用方法 1.准备调试的数据集 2.查看DataLoader的结果 3.完整代码 总结 前言 在pytorch中如何读取数据主要有两个类,分别是Dataset和Dataloader。 dataset可以理解为:提供一种方式去

  • 还是拿来自学用 多谢多谢  对于torch本人无人讨论 拿来主义的 多谢理解 0,Dataset和DataLoader功能简介 Pytorch通常使用Dataset和DataLoader这两个工具类来构建数据管道。 Dataset定义了数据集的内容,它相当于一个类似列表的数据结构,具有确定的长度,能够用索引获取数据集中的元素。 而DataLoader定义了按batch加载数据集的方法,它是一个实现

  •  对shuffle=True的理解: 之前不了解shuffle的实际效果,假设有数据a,b,c,d,不知道batch_size=2后打乱,具体是如下哪一种情况: 1.先按顺序取batch,对batch内打乱,即先取a,b,a,b进行打乱; 2.先打乱,再取batch。 证明是第二种。 from torch.utils.data import TensorDataset import torch f

  • DataLoader是PyTorch中读取数据的一个重要接口,基本上用PyTorch训练模型都会用到。这个接口的目的是:将自定义的Dataset根据batch size大小、是否shuffle等选项封装成一个batch size大小的Tensor,后续只需要再包装成Variable即可作为模型输入用于训练。 PyTorch中的数据读取主要包含三个类,其过程主要是以下四步: 1.Dataset 2.

  •       在说TensorDataset之前我们先简单说一下Dataset。       一般情况下是Dataset和DataLoader配合使用 Dataset :用来整理数据的格式 ,即从全部数据中获取 DataLoader:用来分批次向模型中传入数据,也就是把 Dataset中的全部数据分批次送入模型中 但有时数据格式比较简单,也可以使用TensorDataset的方式来整理数据的格式 例

 相关资料
  • 本文向大家介绍Pytorch自定义Dataset和DataLoader去除不存在和空数据的操作,包括了Pytorch自定义Dataset和DataLoader去除不存在和空数据的操作的使用技巧和注意事项,需要的朋友参考一下 【源码GitHub地址】:点击进入 1. 问题描述 之前写了一篇关于《pytorch Dataset, DataLoader产生自定义的训练数据》的博客,但存在一个问题,我们不

  • 本文向大家介绍pytorch Dataset,DataLoader产生自定义的训练数据案例,包括了pytorch Dataset,DataLoader产生自定义的训练数据案例的使用技巧和注意事项,需要的朋友参考一下 1. torch.utils.data.Dataset datasets这是一个pytorch定义的dataset的源码集合。下面是一个自定义Datasets的基本框架,初始化放在__

  • 本文向大家介绍PyTorch实现重写/改写Dataset并载入Dataloader,包括了PyTorch实现重写/改写Dataset并载入Dataloader的使用技巧和注意事项,需要的朋友参考一下 前言 众所周知,Dataset和Dataloder是pytorch中进行数据载入的部件。必须将数据载入后,再进行深度学习模型的训练。在pytorch的一些案例教学中,常使用torchvision.da

  • 本文向大家介绍pytorch dataloader 取batch_size时候出现bug的解决方式,包括了pytorch dataloader 取batch_size时候出现bug的解决方式的使用技巧和注意事项,需要的朋友参考一下 1、 RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension

  • 我使用torch的标准数据加载器。乌提尔斯。数据我创建dataset类,然后按以下方式构建DataLoader: 它运行完美,但是数据集足够大——300k图像。因此,使用DataLoader读取图像需要大量时间。所以在调试阶段构建这么大的DataLoader真的很糟糕!我只是想测试一些我的假设,想快点做!我不需要为此加载整个数据集。 我试图找到一种方法,如何只加载数据集的一小部分,而不在整个数据集

  • 目前,我有一个预先训练过的模型,它使用数据加载器读取一批图像来训练模型。 我想对图像进行处理(预测),因为它们是从队列中到达的。它应该类似于读取单个图像并运行模型对其进行预测的代码。大致如下: 我想知道您是否可以指导我如何做到这一点,并在DataLoader中应用相同的转换。

  • 我正在处理NLP问题,正在使用PyTorch。由于某些原因,我的数据加载器返回了格式错误的批。我有由句子和整数标签组成的输入数据。这些句子可以是句子列表,也可以是标记列表。稍后我将在下游组件中将标记转换为整数。 我创建了以下自定义数据集: 当我以句子列表的形式提供输入时,数据加载器正确地返回成批完整的句子。请注意,: 批次正确地包含两句话和两个标签,因为。 然而,当我将句子作为标记列表的预标记列表

  • 我遇到了一个自定义pytorch dataloader的问题,我认为它与函数中的浅拷贝和深拷贝有关。但是,有些行为我不理解。我不知道它是来自pytorch dataloader类还是其他地方。 我根据自己的复杂用例创建了一个最小的工作示例。最初,我将一个数据集保存为,并将其加载到中。对于NN,我希望元素归一化为1(我除以它们的总和),并分别返回总和。: 我得到以下输出: 在第一个历元之后,应该给出