问题：

如何将json文件批量插入azure Cosmos DB (documentDB模块)？

朱岳

2023-03-14

我正在使用python使用documentDB模块更新大量数据文件，并使用新的观察结果。我必须每分钟上传100-200个json文件，上传操作比程序的其他部分占用更多的时间。现在，我正在模块中使用DocumentClient的“UpsertDocument”函数。有更快/更好的方法吗？

共有2个答案

宰父阳焱

2023-03-14

一种选择是使用Cosmos DB Spark连接器，并且可选地(并且方便地)在Azure Databricks中作为作业运行。这将为您的吞吐量提供大量的控制，并使您很容易在并行性(我认为这是问题所在)和Cosmos DB上的RU容量之间找到最佳平衡。

这是加载118K文档时进行测量的一个简单示例，这是使用仅有1个辅助角色的最小规格数据库集群。

Python中的单个Cosmos客户端：236 RUs时28个文档/秒（即根本不推Cosmos）

Spark Cosmos DB适配器，66 docs/sec@

…将Cosmos DB升级到10K RUs Spark Cosmos数据库适配器后，1317文档/秒@

您也可以尝试Python多线程（我认为这会有所帮助），正如CYMA在评论中所说，您应该检查Cosmos DB的限制。不过，我的观察是，单个Cosmos客户端甚至不会让您达到最低400 RU。

雷曜灿

2023-03-14

您可以使用存储过程进行批量插入操作:

function bulkimport2(docObject) {
var collection = getContext().getCollection();
var collectionLink = collection.getSelfLink();

// The count of imported docs, also used as current doc index.
var count = 0;

getContext().getResponse().setBody(docObject.items);
//return

// Validate input.
//if (!docObject.items || !docObject.items.length) getContext().getResponse().setBody(docObject);
docObject.items=JSON.stringify(docObject.items)
docObject.items = docObject.items.replace("\\\\r", "");
docObject.items = docObject.items.replace("\\\\n", "");
var docs = JSON.parse(docObject.items);
var docsLength = docObject.items.length;
if (docsLength == 0) {
    getContext().getResponse().setBody(0);
    return;
}

// Call the CRUD API to create a document.
tryCreate(docs[count], callback, collectionLink,count);

// Note that there are 2 exit conditions:
// 1) The createDocument request was not accepted.
//    In this case the callback will not be called, we just call setBody and we are done.
// 2) The callback was called docs.length times.
//    In this case all documents were created and we don't need to call tryCreate anymore. Just call setBody and we are done.
function tryCreate(doc, callback, collectionLink,count ) {
    doc=JSON.stringify(doc);
    if (typeof doc == "undefined") {
        getContext().getResponse().setBody(count);
        return ;
        } else {
        doc = doc.replace("\\r", "");
        doc = doc.replace("\\n", "");
        doc=JSON.parse(doc);
       }

    getContext().getResponse().setBody(doc);

    var isAccepted = collection.upsertDocument(collectionLink, doc, callback);

    // If the request was accepted, callback will be called.
    // Otherwise report current count back to the client, 
    // which will call the script again with remaining set of docs.
    // This condition will happen when this stored procedure has been running too long
    // and is about to get cancelled by the server. This will allow the calling client
    // to resume this batch from the point we got to before isAccepted was set to false
    if (!isAccepted) {
        getContext().getResponse().setBody(count);
     }
}

// This is called when collection.createDocument is done and the document has been persisted.
function callback(err, doc, options) {
    if (err) throw getContext().getResponse().setBody(err + doc);

    // One more document has been inserted, increment the count.
    count++;

    if (count >= docsLength) {
        // If we have created all documents, we are done. Just set the response.
        getContext().getResponse().setBody(count);
        return ;
    } else {
        // Create next document.
        tryCreate(docs[count], callback,  collectionLink,count);
    }
}

然后你可以加载python并执行它。请注意，存储过程需要分区键。

希望它有所帮助。

如何将json文件批量插入azure Cosmos DB (documentDB模块)？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档