当前位置: 首页 > 知识库问答 >
问题:

如何使用Apache Beam(Python)将多个嵌套JSON写入BigQuery表

苏乐童
2023-03-14

我正在使用Python从Dataflow向BigQuery表编写一个复杂的JSON对象集合。手动创建如下表模式太复杂了,因为我的json对象嵌套了多个层。

from apache_beam.io.gcp.internal.clients import bigquery

table_schema=bigquery.TableSchema()

id_schema = bigquery.TableFieldSchema()
id_schema.name = 'ID'
id_schema.type = 'integer'
id_schema.mode = 'nullable'
table_schema.fields.append(id_schema)
...

所以我尝试了从Dataflow(Python)将嵌套模式写入BigQuery推荐的方法。首先,我在云控制台中运行以下命令来获取模式

bq --format=json show project:dataset.table > output_schema.json

然后我运行下面的代码来获取表模式,

table_schema = parse_table_schema_from_json(json.dumps(json.load(open("output_schema.json"))["schema"]))

这完全符合预期。该表最初是从Jupyter笔记本创建的,我可以在其中使用bigquery。具有自动检测功能的 LoadJobConfig 可在不提供架构的情况下写入 BigQuery。

现在我使用 Apache Beam 管道尝试使用此模式写入 BigQuery,不知何故我遇到了一些错误,例如:

WARNING:apache_beam.io.gcp.bigquery:There were errors inserting to BigQuery. Will retry. Errors were [<InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 0>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.errors'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 1>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 2>]

我的表模式是:

table_schema = {
  "fields": [
    {"name": "ID", "type": "INTEGER", "mode": "NULLABLE"},
    {"name": "SourceResourceID","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentText","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentName","type": "STRING","mode": "NULLABLE"},
    {"name": "EncounterNumber","type": "FLOAT","mode": "NULLABLE"},
    {"name": "EncounterResourceID","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentId","type": "STRING","mode": "NULLABLE"},
    {"name": "DocumentDate","type": "TIMESTAMP","mode": "NULLABLE"},
    {"name": "SectionTitle","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionHeader","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionText","type": "STRING","mode": "NULLABLE"},
    {"name": "SectionTokens","type": "RECORD","mode": "NULLABLE",
      "fields": [
        {"name": "documents","type": "RECORD","mode": "NULLABLE",
          "fields": [
            {"name": "list","type": "RECORD","mode": "REPEATED",
              "fields": [
                {"name": "item","type": "RECORD","mode": "NULLABLE",
                  "fields": [
                    {"name": "entities","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "list","type": "RECORD","mode": "REPEATED",
                          "fields": [
                            {"name": "item","type": "RECORD","mode": "NULLABLE",
                              "fields": [
                                {"name": "category","type": "STRING","mode": "NULLABLE"},
                                {"name": "confidenceScore","type": "FLOAT","mode": "NULLABLE"},
                                {"name": "id","type": "STRING","mode": "NULLABLE"},
                                {"name": "isNegated","type": "BOOLEAN","mode": "NULLABLE"},
                                {"name": "length","type": "INTEGER","mode": "NULLABLE"},
                                {"name": "links","type": "RECORD","mode": "NULLABLE",
                                  "fields": [
                                    {"name": "list","type": "RECORD","mode": "REPEATED",
                                      "fields": [
                                        {"name": "item","type": "RECORD","mode": "NULLABLE",
                                          "fields": [
                                            {"name": "dataSource","type": "STRING","mode":"NULLABLE"},
                                            {"name": "id","type": "STRING","mode": "NULLABLE"}
                                          ]
                                        }
                                      ]
                                    }
                                  ]
                                },
                                {"name": "offset","type": "INTEGER","mode": "NULLABLE"},
                                {"name": "text","type": "STRING","mode": "NULLABLE"}
                              ]
                            }
                          ]
                        }
                      ]
                    },
                    {"name": "id","type": "STRING","mode": "NULLABLE"},
                    {"name": "relations","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "list","type": "RECORD","mode": "REPEATED",
                          "fields": [
                            {"name": "item","type": "RECORD","mode": "NULLABLE",
                              "fields": [
                                {"name": "bidirectional","type": "BOOLEAN","mode": "NULLABLE"},
                                {"name": "relationType","type": "STRING","mode": "NULLABLE"},
                                {"name": "source","type": "STRING","mode": "NULLABLE"},
                                {"name": "target","type": "STRING","mode": "NULLABLE"}
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        {"name": "errors","type": "RECORD","mode": "NULLABLE",
          "fields": [
            {"name": "list","type": "RECORD","mode": "REPEATED",
              "fields": [
                {"name": "item","type": "RECORD","mode": "NULLABLE",
                  "fields": [
                    {"name": "error","type": "RECORD","mode": "NULLABLE",
                      "fields": [
                        {"name": "code","type": "STRING","mode": "NULLABLE"},
                        {"name": "innererror","type": "RECORD","mode": "NULLABLE",
                          "fields": [
                            {"name": "code","type": "STRING","mode": "NULLABLE"},
                            {"name": "message","type": "STRING","mode": "NULLABLE"}
                          ]
                        },
                        {"name": "message","type": "STRING","mode": "NULLABLE"}
                      ]
                    },
                    {"name": "id","type": "STRING","mode": "NULLABLE"}
                  ]
                }
              ]
            }
          ]
        },
        {"name": "modelVersion","type": "STRING","mode": "NULLABLE"}
      ]
    }
  ]
}

下面是一些示例数据:

{'ID': 123, 'SourceResourceID': 'Resource/3c81b4d2-3ee9-11eb-8bf6-0242ac100303', 'DocumentText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n', 'DocumentName': 'CT CHEST IC', 'EncounterNumber': None, 'EncounterResourceID': 'Encounter/123', 'DocumentId': '123', 'DocumentDate': '2020-12-15 10:21:00 UTC', 'SectionTitle': 'physical_exam', 'SectionHeader': 'EXAM:', 'SectionText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n \n\n\n', 'SectionTokens': {'documents': [{'id': '1', 'entities': [{'id': '0', 'offset': 7, 'length': 11, 'text': 'CT CHEST IC', 'category': 'ExaminationName', 'confidenceScore': 0.98, 'isNegated': False}]}], 'errors': [], 'modelVersion': '2020-09-03'}}

有人能帮忙找出我做错了什么吗?谢谢

共有1个答案

麹高义
2023-03-14

在您的架构中,sectiontokens.documentssectiontokens.errors 被指定为 RECORD 类型,这意味着 BigQuery 预计该字段只有一个记录,但在您的数据中,这些键实际上是对象列表。

如果要定义一列以接受对象列表,则需要具有“mode”:“REPEATED”https://cloud.google.com/bigquery/docs/nested-repeated

 类似资料:
  • 我试图使用Apache Beam将一些看起来像Json的数据流作为字符串写入BigQuery。数据是一个数组中的一个数组,由于不可能将嵌套数组写入Bigquery,所以我在主记录地址中创建了一个子记录Address_Instance。我需要帮助将我的嵌套记录写入BigQuery。 下面是我的数据的样子,以及我如何将其转换为Bigquery。 这就是我得到的错误: 下面是我的BigQuery模式的样

  • 我有一个数据流作业要写入BigQuery。它对于非嵌套模式很好,但是对于嵌套模式却失败了。 下面是我的数据流管道: 我使用以下模式创建了BigQuery表: 我得到以下错误: 有人能给我指路吗?我做错了什么?此外,如果有更好的方法迭代所有嵌套模式并写入BigQuery,请建议? 其他信息我的数据文件:

  • 问题内容: 我可以使用Golang将平面对象插入BigQuery中-如何将嵌套数据插入表中? 我的BigQuery模式如下所示(从示例中): 我的第一次插入尝试如下所示(示例): 哪个变平并插入没有问题。我只是在用visit_duration 但是,我需要遍历一个片段并添加到访问记录中。我试图建立一个 访问 对象(没有要测试的循环)并将其添加到行中,但它没有插入,并且我没有收到任何错误: - -[

  • 问题内容: 我正在尝试将嵌套的字典写入.csv文件。这是一个简单的示例: 这使我得到一个包含两列的表:第一个包含; 第二个包含[2,1,1](或子词典中的相应值)。我想要一个有四列的表:一列对应的列表元素,然后三列对应的列表元素。 问题答案: 更改: 至: 否则,您会尝试向csv编写类似的内容,而您的意思是。 如Padraic所述,您可能希望更改为或避免随机行为。

  • 我有多个json文件,其中包含需要合并的关系数据,每个文件都有一个带有commonkey的记录,这是所有文件中的公共键,在下面的示例中,a0,a1是公共键。值是多个键的嵌套字典,如Key1、key2等,如下所示,我需要合并多个json文件并获得如dboutput.json所示的输出,文件名在合并操作中充当索引。这样的问题是一个相关的问题,它合并了丢失的信息,但在我的情况下,我不希望任何更新替换现有

  • 我正在将JSON插入Google Big Query中。问题的底部是JSON的模式。 下面是一个JSON示例: 如您所见,有一个“_source.fullname”字段带有“Mike Smith”。 当我试图用它创建表时,它出错了: 为非重复字段指定的数组:_source.fullname。 我相信这个字段是_source的一次性字段。我如何克服这个错误? 下面是模式:

  • 问题内容: 我正在努力解决这个问题。我有一个JSON文件,需要将其放到CSV中,如果结构是扁平的,没有深层嵌套的项目,就可以了。 但是在这种情况下,嵌套会使我烦恼。 我将如何以这种格式获取数据: 每个对象和对象中的每个种族? 我想这样输出到CSV: 所以首先我得到正确的密钥: 然后添加数据: 但是如何? 对于您熟练的forlooper来说,这应该是一个有趣的过程。 问题答案: 我只收集第一个对象的

  • 问题内容: 我正在实现一个命令行程序,其界面如下: 我已经阅读了argparse文档。我可以使用in实现为可选参数。以及using子命令。 从文档看来,我只能有一个子命令。但是如您所见,我必须实现一个或多个子命令。解析此类命令行参数使用的最佳方法是什么? 问题答案: @mgilson对这个问题有很好的答案。但是我自己拆分sys.argv的问题是我丢失了Argparse为用户生成的所有有用的帮助消息