问题：

从多个XML节点提取值[重复]

邵博远

2023-03-14

我有以下数据结构（原始数据结构为2.5gb，因此必须进行解析）：

<households xmlns="http://www.matsim.org/files/dtd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.matsim.org/files/dtd http://www.matsim.org/files/dtd/households_v1.0.xsd">
    <household id="1473">
        <members>
            <personId refId="2714"/>
            <personId refId="2715"/>
            <personId refId="2716"/>
            <personId refId="2717"/>
            <personId refId="2718"/>
            <personId refId="2719"/>
        </members>
        <income currency="CHF" period="month">
                3094.87101
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >some</attribute>
            <attribute name="carAvailability" class="java.lang.String" >some</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.3</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >3094.8710104279835</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >10213.074334412346</attribute>
        </attributes>

    </household>
    <household id="2474">
        <members>
            <personId refId="4647"/>
            <personId refId="4648"/>
            <personId refId="4649"/>
            <personId refId="4650"/>
            <personId refId="4651"/>
            <personId refId="4652"/>
            <personId refId="4653"/>
            <personId refId="4654"/>
            <personId refId="4655"/>
        </members>
        <income currency="CHF" period="month">
                1602.562822
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >all</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >3.6999999999999997</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >1602.5628215679633</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >1</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >5929.482439801463</attribute>
        </attributes>

    </household>
    <household id="4024">
        <members>
            <personId refId="7685"/>
        </members>
        <income currency="CHF" period="month">
                61610.096619
        </income>
        <attributes>
            <attribute name="bikeAvailability" class="java.lang.String" >none</attribute>
            <attribute name="carAvailability" class="java.lang.String" >none</attribute>
            <attribute name="consumptionUnits" class="java.lang.Double" >1.0</attribute>
            <attribute name="householdIncomePerConsumptionUnit" class="java.lang.Double" >61610.096618936936</attribute>
            <attribute name="numberOfCars" class="java.lang.Integer" >0</attribute>
            <attribute name="residenceZoneCategory" class="java.lang.Integer" >1</attribute>
            <attribute name="totalHouseholdIncome" class="java.lang.Double" >61610.096618936936</attribute>
        </attributes>

    </household>
</households>

我想提取所有个人ID refId值及其相应的收入值。最后，我计划有一个带有一列personId和一列income的df（income将是重复的）。因此，棘手的部分不仅是名称空间，还包括如何在不同的节点级别访问XML。

到目前为止，我的方法无法做到这一点。

import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
import numpy as np

tree = ET.parse(gzip.open('V0_1pm/output_households.xml.gz', 'r'))
root = tree.getroot()
rows = []
for it in root.iter('household'):
    hh = it.attrib['id']
    inc = it.find('income').text
    rows.append([hh,inc])

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

非常感谢您的帮助。

共有1个答案

孟海

2023-03-14

代码失败的原因是输入元素的命名空间非空。

处理命名空间XML的方法之一是：

定义一个字典“快捷方式：名称空间”，包含XPath表达式中使用的所有名称空间
调用findall或find，将此字典作为第二个参数传递，并在XPath表达式中前置相关的命名空间快捷方式（冒号作为分隔符）

还请注意find（…）。text返回全文，带有换行符和空格。要解决此问题，您可能应该：

去掉从周围白色字符中读取的内容。
将其转换为浮动。

因此，请将代码更改为：

# Namespace dictionary
ns = {'dtd': 'http://www.matsim.org/files/dtd'}
rows = []
for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    rows.append([hh, inc])
hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
hh_inc

对于您的示例输入，我得到：

     id  PTSubscription
0  1473     3094.871010
1  2474     1602.562822
2  4024    61610.096619

我假设数据帧应该为每个refId包含单独的行，带有相关id和PTSubscription。

要包括refId，请将循环更改为：

for it in root.findall('dtd:household', ns):
    hh = it.attrib['id']
    inc = it.find('dtd:income', ns).text
    inc = float(inc.strip())
    pids = it.findall('.//dtd:personId', ns)
    for pId in pids:
        refId = pId.attrib['refId']
        rows.append([hh, inc, int(refId)])
    if not pids:
        rows.append([hh, inc, -1])

我添加了最后2个说明，以避免“丢失”任何不含refId的家庭。

创建DataFrame时，传递附加列名：

hh_inc = pd.DataFrame(rows, columns=['id', 'PTSubscription', 'refId'])

类似资料：

如何使用ORACLE SQL从XML分别提取可重复的json节点值？

问题内容：我有以下的XML和我想提取JSON参数“serviceNumber”的价值观分开我尝试使用EXTRACT功能，但我得到的结果连接起来，但我希望他们分开。我想要像这样的输出问题答案：用途：
从xml中提取值

Q非常业余的程序员在这里，寻求你的帮助。我必须经常编辑这样的xml文件使用一个相当复杂的正则表达式搜索和替换过程，我只能提取标记属性的值。（这就是我所关心的）。但是这很耗时，而且在Python中必须有非常简单的方法来查找属性标记="SOME_TEXT"部分并将所有值放入一个数组中，然后打印出该数组（到文件中）。但是我无法弄清楚：（我正在寻找一种不包括导入任何类型的XML库的方法，因为我想让
基于关键字从xml中提取节点

我有一个如下所示的XML，正在尝试基于关键字提取节点。尝试使用XPath和XMLLint。但很明显，我做得不对。希望能在这方面有所帮助。 XML文件给定此XML和关键字任务目标（不区分大小写），我需要提取整个节点并写入另一个XML文件我尝试使用Xpath和XMllint进行提取。有谁能告诉我上面的问题是什么，我如何解决？此外，我想在文件目录的shell中执行此操作。XMLlint是最佳选择吗
基于属性和节点值提取节点

对于下面的XML，我正在尝试根据属性和节点值提取节点。基于属性class=pass和h1包含（'objectives'），我试图提取以下输出。 “目标”是节点值字符串“1任务目标”的一部分 1任务目标1目标2 下面是我正在尝试的XPath表达式。然而，这并没有产生任何输出。你能指出我做错了什么吗？谢谢
从单个Java流中提取多个值

我必须提取两个值（最小
XML Oracle：多个子节点摘录

问题内容：我有一个xml代码：我的代码：我的问题：如何循环子节点，以便数据将变成这样：任何帮助将不胜感激。问题答案：您可以使用以下功能获得所需的结果：结果： SQLFiddle演示了解有关功能的更多信息。注意：使用11.2.0.2之前的Oracle版本时，当参数设置为或（从11.2开始不推荐使用）时，您会遇到某些类型的XML查询的（错误8545377 ）。将参数设置为（默认值）

从多个XML节点提取值[重复]

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档