当前位置: 首页 > 知识库问答 >
问题:

正则表达式用于解析在"列"中包含JSON的CSV

太叔志尚
2023-03-14

我需要解析一个逗号分隔的文本文件,其中一行可以包含一个json作为列值。该文件还包含一个标题行。

我试图构建一个正则表达式,这样我就可以解析文件并为文件的每一行和hashmap列表创建hashmap。

我意识到regex可能会过于复杂,但我不知道还能如何设计代码来获得每行的hashmap。

我尝试了不同的方法——从SO和其他网站上的例子,但无法获得良好的工作表达。它们处理不同的部分,但不处理整个数据

    String[] values = readLine.split(",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'\",]*(?:'|\")){2})*[^'\",]*)", -1);
    String[] values = readLine.split(",(\"([^\"]|\"\")*\")", -1);
    String[] values = readLine.split(",(\\w+\\s)?(\"[^\"]+\"|\\w+)(\\(\\w\\d(,\\w\\d)*\\))?", -1);
    String[] values = readLine.split(",(?=(?:(?:[^'",]*(?:'|")){2})*[^'",]*$)", -1);

下面是数据样本

id,apptoolID,apptoolUUID,accountNumber,accountName,name,description,mac,status,pGroups,oemCode,oemTagList,locationID,userCode,businessUnit,customerDescription,notification,ptdd,hdptdd,ptddSchedule,compLive,optInfo,config1,scriptDebugging,clearLocalStorage,created_at,updated_at

5703,2535,9e849f81-56c4-4415-b36e-b0fd370b3986,21126,"3 Way ChProduct",ZYX-21126TC-P1,"3 Way ChProduct - CHSH2112601DSHCL - Showroom",24-1c-04-0a-84-9c,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHSH2112601DSHCL,"{"oemCode":"222923"}",Showroom,Showroom,false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5704,4248,494af61a-54ca-4ff2-a82d-7e795fd671ac,21126,"3 Way ChProduct",ZYX-21126TC-P2,"3 Way ChProduct - CISW2112602DTHCL - Service Waiting Area",e0-d5-5e-63-51-b9,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":27,"name":"FLEX Shop TV Players","description":"FLEX Shop TV Players group","numberOfPlayers":169},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CI,"ChProduct,CADILLAC,ALL,SHOP-SPECIFIC",CISW2112602DTHCL,"{"oemCode":"222923"}",Service_Lounge,"Service Waiting Area",true,true,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":"PSA337,PSA335,PSA332,PSA338,PSA326,PSA323,PS212,PS213,PSA293,PSA289","OPT_IN_SP":"SPCH1807","OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":"044077,034160,029205,022398,019888,019881,019880,019860,017596,010857","OPT_OUT_FA":"FACA1900,FACA1903,FACA1904,FACA1902,FACA1901,FACA1704,FACA1705,FACA1702,FACA1703,FACA1700,FACA1701,FACA1625,FACA1624,FACA1623,FACA1602,FACA1601,FACA1603,FACA0025,FACA0024,FACA0023,FACA1621,FACA00020,FACA00019,FACA1622,FACH1701","OPT_OUT_GT":false,"OPT_IN_TT":"TT043,TT037,TT033,TT032,TT031,TT027,TT028,TT026,TT029,TT030"}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"
5705,2537,94c4e9dc-e94e-4942-862a-1e4c98276f09,21126,"3 Way ChProduct",ZYX-21126TC-P3,"3 Way ChProduct - CHMA2112603MBHCL - Service Advisor Area & Service Waiting Area",24-1c-04-0a-84-b7,ACTIVE,"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]",CH,"ChProduct,ALL,SHOP-SPECIFIC",CHMA2112603MBHCL,"{"oemCode":"222923"}",Service,"Service Advisor Area & Service Waiting Area",false,false,false,,false,"{"OPT_IN_FT":false,"OPT_IN_PSA":false,"OPT_IN_SP":false,"OPT_IN_TS":false,"OPT_IN_VS":false,"OPT_IN_WA":false,"OPT_OUT_ZYX":false,"OPT_OUT_SHOPCONTENT":false,"OPT_OUT_FA":false,"OPT_OUT_GT":false,"OPT_IN_TT":false}",2X2,false,false,"2019-02-18 20:57:53","2019-02-20 19:16:08"

基本上,很难将这种类型的数据(在[]中)作为单个值获取。该文件还有其他类型的数据,如嵌套引号、引号中的逗号等。

"[{"id":46,"name":"AX_AGRP_SIRIUSXM_BAC_AXP_GROUP","description":"Shops that can play the AX created\/provided Sirius XM creative.","numberOfPlayers":4494},{"id":33,"name":"AX_ZYX_CUSTOMER","numberOfPlayers":4532},{"id":11,"name":"PST Update","numberOfPlayers":604},{"id":2,"name":"Customer Players","numberOfPlayers":5711},{"id":41,"name":"AX_AGRP_EXTRA_PRICING_PLAN_BAC_AXP_GROUP","description":"AX auto-group Extra Pricing Plan  BAC Group","numberOfPlayers":2215},{"id":39,"name":"AX_AGRP_MY_REWARDS_NATIONAL_GROUP","description":"AX auto-group My Rewards (National) BAC List","numberOfPlayers":2130}]"

感谢任何关于如何继续的帮助/建议。

谢啦

共有3个答案

琴宾鸿
2023-03-14

虽然已经两年了,但我希望这个答案能帮助那些从谷歌来到这里的人。

我的解决方案是一个接一个地读取字符,并使用括号匹配的思想,这是从“堆栈”的介绍中产生的。

当涉及到{或[时,推入堆栈。

当涉及到“}”或“]”时,从堆栈中弹出。

当涉及到“,”时,检查堆栈。如果为空,那么您已经读取了一个列字段,或者继续读取(这个逗号在json中)。

下面是java的示例代码,我只是使用了一个bracketCount变量,而不是真正的堆栈:

        File file = new File("D:/test.csv");
        FileReader fileReader = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileReader);
        List<List<String>> results = reader.lines().map(str -> {
            List<String> lineResult = new ArrayList<>(256);
            char[] chars = str.toCharArray();
            int bracketCount = 0;
            StringBuilder fieldBuilder = new StringBuilder();
            for (int index = 0; index < chars.length; index++) {
                fieldBuilder.append(chars[index]);
                if (chars[index] == '{' || chars[index] == '[') {
                    bracketCount++;
                } else if (chars[index] == '}' || chars[index] == ']') {
                    bracketCount--;
                } else if (chars[index] == ',') {
                    if (bracketCount == 0) {
                        lineResult.add(fieldBuilder.substring(0, fieldBuilder.length() - 1));
                        fieldBuilder.setLength(0);
                    }
                } else if (index == chars.length - 1) {
                    lineResult.add(fieldBuilder.substring(0, fieldBuilder.length()));
                    fieldBuilder.setLength(0);
                }
            }
            return lineResult;
        }).collect(Collectors.toList());

PS:不考虑性能。

谷梁浩思
2023-03-14

如果分隔符为“|”,则此CSV文件可能由

public List<Map<String, String>> parse(Path csvPath) throws IOException {
        Reader in = new FileReader(csvPath.toFile());

        Iterable<CSVRecord> records = CSVFormat.DEFAULT
                .withFirstRecordAsHeader()
                .withIgnoreEmptyLines(true)
                .withDelimiter('|')
                .withTrim()
                .withQuote(null)
                .parse(in);


        return StreamSupport
                        .stream(records.spliterator(), false)
                        .map(csvRecord -> csvRecord.toMap().entrySet().stream()
                                .collect(Collectors.toMap(
                                        e -> e.getKey(),
                                        e -> StringUtils.unwrap(e.getValue(), "\"")
                                )))
                        .collect(Collectors.toList());
} 

你需要libs:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.6</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.9</version>
</dependency>

请看一个项目https://github.com/sbzDev/stackoverflow/tree/master/question56028130使用单元测试

赵飞语
2023-03-14

如果你的输入数据是干净的,那么这个问题很容易解决。但是,您的JSON字符串没有正确转义,例如,“{”OPT_IN_FT“:false,“OPT_IN_PSA:”PSA337…使用双引号来表示CSV上下文和JSON上下文中的字符串。

我的建议是在JSON上下文中为项目使用单引号,并在JSON字符串项目中使用转义引号。

然后,您可以使用正则表达式或自定义编写的解析器代码轻松地调整这样或那样的解决方案。

如果您想将CSV上下文中的字符串分隔符更改为单个引号,您可以尝试以下方式:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Test
{
    public static void main (String[] args) throws java.lang.Exception
    {
      String input = "...";
      Pattern re1= Pattern.compile("(?<=^|,)(?:\"(\\[.*?\\])\"|\"(\\{.*?\\})\"|\"([^\"\\{\\}\\[\\]]*)\")(?=,|$)", Pattern.DOTALL);
      Matcher matcher = re1.matcher(input);
      String singleQuoted = matcher.replaceAll("'$1$2$3'");

      Pattern re = Pattern.compile(",(?=(?:[^']*'[^']*')*[^']*$)");
      String[] parts = re.split(singleQuoted);
        for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ){
          System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
        }
     }
}

显然有更优雅的方法来做到这一点。

 类似资料:
  • 问题内容: 我想看看是否有可能查找个人一出字符串中的Javascript并返回它用。有点像构建搜索工具。 想象以下JSON 我希望能够搜索字符串并仅提取单个属性。 让我们假设它已经是一个,看起来就像。 这将返回的子串的。 例如 如果您想知道为什么为什么要代替使用它,我正在建立一个JSON文档存储。仅支持键/值对,因此我将整个字符串存储在unique中。我希望能够在文档上运行查询,理想情况下无需整个

  • 我需要一些“包含”和“不包含”的正则表达式。通常我会写: 包含:(.*WORD_A.*)$,不包含:(^((?!WORD_A.*)*)$ 如果单独使用,这很好,但我想写一些可以检测到的东西,比如“包含单词A和单词B”(顺序不相关!)和“包含单词A,但不包含单词B)。 基本上,我希望用户可以做出这样的声明“以单词a开头,包含单词B,但不是C和/或以D结尾”,程序返回真/假。最好的办法是只附加正则表达

  • 我发现正则表达式不包含101,如下所示: 0*1*0*(1 00 000)*(010)* 我无法理解作者是如何想出这个正则表达式的。所以我只想到了不包含101的字符串: 01000100 我似乎上面的字符串不会与上面的正则表达式匹配。但我不确定。所以尝试在regex101.com上翻译成等效的pcre正则表达式,但也失败了(因为可以看到我的正则表达式甚至不匹配包含单个1的字符串。 我的翻译有什么问

  • 问题内容: 我有一些复杂的正则表达式,为了便于阅读和维护,需要对其进行注释。Java规范相当简洁,为了使它正常工作,我努力了很长时间。我终于抓住了我的错误,并将其发布为答案,但对于保持正则表达式的任何其他建议,我将不胜感激 作为示例,我想在一个简单的名称解析器中注释(patternS的)子组件: 编辑:我也将感谢(?x)格式的示例。 编辑:@ geowa4有一个很好的建议,可以避免嵌入注释。自从J

  • 问题内容: 我想获取HTML中隐藏的输入字段的值。 我想用Python编写一个正则表达式,该表达式将返回的值,因为我知道HTML中的行遵循以下格式 有人可以提供Python范例来解析HTML值吗? 问题答案: 对于这种特殊情况,BeautifulSoup比正则表达式更难编写,但是它更健壮…我只是为BeautifulSoup示例提供帮助,因为您已经知道要使用哪个正则表达式:-)

  • Java中的正则表达式问题。我正在从元素href属性中提取Id号。我在一个字符串中有一堆这样的链接: