当前位置: 首页 > 知识库问答 >
问题:

将具有相同标题的多个CSV文件合并到不同的组文件中

卓新知
2023-03-14

嗨,我正在寻找一个最快的解决方案来处理csv文件的负载。

情况:我在一个文件夹中有多个csv文件,它们的标题不同

我已经对它们进行了预处理,以删除顶部的垃圾行,因此所有这些都有一个标准标头。

我想将一组CSV文件与完全相同的侦听器合并到一个新文件夹中

Single Folder:
    Tree 
    ├── 161598827330618_data_aa.csv 
    ├── ..............  
    ├── ............... 
    ├── ................ 
    ├── 161598852706227_data_bh.csv 
Note: Filenames are Random with no pattern*

示例文件-1。csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
George HS - QUEENS,New,76850000,CAP,Q298,50-51 98TH STREET,Queens,11368 
MARBLE HILL INTERNATIONAL HS -,EXT MASONRY/FLOOD/PARAPETS/ROOFS,10490000,CIP,X475,99 TERRACE VIEW AVENUE,Bronx,10463
NEW DORP HS - STATEN ISLAND,PARTIAL ACCESSIBILITY,488000,CIP,R435,465 NEW DORP LANE,Staten Island,10306

示例文件-2。csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
EAST SIDE COMMUNITY SCHOOL,FIFTH FLOOR CEILING REPLACEMENT,150000,CIP,M060,420 EAST 12 STREET,Manhattan,10009
RICHMOND HILL HS - QUEENS,STEEL DETERIORATED COLUMS & COLUMN,1064400,CIP,Q475,89-30 114 STREET,Queens,11418
SUCCESS ACADEMY CHARTER SCHOOL,INTERIOR STAIRS,2045000,CIP,M099,410 EAST 100 STREET,Manhattan,10029

样本文件-3。csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

样本文件-4。csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

样本文件-5。csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
283721,Brooklyn,Street,,,Populated,C90AAD08-D99E-4759-A64C-219D6143BFB3,07-08-15 13:10,12/20/2019 04:34:58 PM
7669836,Queens,Park,Astoria Park,Q004-ZN02,Empty,AB55A658-8276-4734-A698-5FFCAE96578E,08/13/2020 01:18:00 PM,08/20/2020 06:15:32 PM
7123408,Brooklyn,Park,Asser Levy Park,,Populated,B32D93C9-5958-4129-A87A-FA7C9A5A4E87,01-09-20 13:15,01-09-20 13:17

样品File-6.csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
6036681,Manhattan,Park,Riverside Park,,Populated,6A3E747D-CD5E-43EB-9789-67DB2064E878,04-11-18 11:11,08-06-20 21:21
7170578,Bronx,Park,Garden Of Eden,,Populated,B1E8B660-4B65-437F-B61F-06B1B71A4E1C,01/28/2020 03:18:00 PM,01/28/2020 03:19:26 PM
740416,Bronx,Park,Mullaly Park,X034-ZN02,Populated,E8F51E3B-CC6F-46A3-AF17-02B6BE8DCC57,08/26/2015 04:34:00 PM,01/30/2020 04:10:41 PM
5004669,Queens,Street,,,Populated,20157769-88EC-4867-9F50-852EF4814BF0,11-02-16 16:56,08-03-20 13:12:00 AM

样本文件-7。csv

Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Value,Low CI,High CI,Confidence Interval
Private Health Insurance Coverage,National Estimate,United States,United States,1,1,Apr 23 - May 5,75.4,74.7,76.2,74.7 - 76.2
Public Health Insurance Coverage,By Age,United States,18 - 24 years,1,1,Apr 23 - May 5,19.5,15.4,24.3,15.4 - 24.3
Uninsured at the Time of Interview,By Gender,United States,Female,1,1,Apr 23 - May 5,11,10.3,11.7,10.3 - 11.7

样本文件-8。csv

Year, dtmSurveyDate, ColonyID, strAOUCode, Type, strPhotoInterpreters, strColonyName, strCounty, strState, strCountry
2014,03-Jun-14,219-001,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Eastern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-002,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Northern Rock)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-003,COMU,Shapefile-RawCount,Kirsten Bixler,"""Tillamook Head Rocks"" (Southwestern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-005,COMU,Shapefile,Shawn W. Stephensen,Tillamook Rock,Clatsop County,Oregon,United States

预期结果:

样本File-1.csv}
样本File-2.csv}Header check

样本文件-3。csv}
示例文件-4。csv}头检查

样本文件-5。csv}
示例文件-6。csv}头检查

样本文件-7。csv}头检查

首选解决方案:Bash脚本和Linux命令尝试过的解决方案:

#!/bin/bash
awk '
  FNR==1{
    if (!($0 in h)||file!=h[$0]){close(file)}
    if (!($0 in h)){file=h[$0]=i++}
    else{file=h[$0];next}
  }
  {print >> (file)}
' ./*.csv

https://unix.stackexchange.com/a/602291/459978

上述方法有效,但我不确定它是否能处理1000个文件进行处理和分类。我需要一个小组。要在其他文件夹中创建的csv格式。

最短的完成时间很重要https://stackoverflow.com/a/51921621/3088275

寻找带有Awk或Sed的Op代码或bash脚本的Linux命令,这永远是实现所需输出的最快速度

共有1个答案

劳烨
2023-03-14

您可以使用以下AWK脚本。我用所有的样本文件进行了测试。

!NF { next }
NR % 3 == 1 { filename = substr($0, 5, length($0)-8) }
NR % 3 == 2 { headers[$0] = headers[$0] (headers[$0] == "" ? "" : ",") filename }

END {
  i=1
  for (header in headers) {\
    printf("Group %02d: %s\n", i, headers[header])
    split(headers[header], a, ",")
    for (idx in a) {
      getline x < a[idx]
      printf "" > sprintf("group%02d.txt", i)
      while (getline x < a[idx] > 0)
        print x >> (sprintf("group%02d.txt", i))
    }
    i++
  }
}
$ ls Sample\ File-*
Sample File-1.csv Sample File-3.csv Sample File-5.csv Sample File-7.csv
Sample File-2.csv Sample File-4.csv Sample File-6.csv Sample File-8.csv
$ head -n 1 Sample\ File-* | awk -f script.awk
Group 01: Sample File-1.csv,Sample File-2.csv
Group 02: Sample File-8.csv
Group 03: Sample File-7.csv
Group 04: Sample File-5.csv,Sample File-6.csv
Group 05: Sample File-3.csv,Sample File-4.csv
$ cat group01.txt
George HS - QUEENS,New,76850000,CAP,Q298,50-51 98TH STREET,Queens,11368
MARBLE HILL INTERNATIONAL HS -,EXT MASONRY/FLOOD/PARAPETS/ROOFS,10490000,CIP,X475,99 TERRACE VIEW AVENUE,Bronx,10463
NEW DORP HS - STATEN ISLAND,PARTIAL ACCESSIBILITY,488000,CIP,R435,465 NEW DORP LANE,Staten Island,10306
EAST SIDE COMMUNITY SCHOOL,FIFTH FLOOR CEILING REPLACEMENT,150000,CIP,M060,420 EAST 12 STREET,Manhattan,10009
RICHMOND HILL HS - QUEENS,STEEL DETERIORATED COLUMS & COLUMN,1064400,CIP,Q475,89-30 114 STREET,Queens,11418
SUCCESS ACADEMY CHARTER SCHOOL,INTERIOR STAIRS,2045000,CIP,M099,410 EAST 100 STREET,Manhattan,10029
$ cat group02.txt
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority
 类似资料:
  • 我有数百万个不同标题的csv文件,我想把它们合并到一个大数据框中。 我的问题是我尝试过的解决方案有效,但太慢了!顺便说一句,我可以访问Sparklyr在我的实验室中处理多节点集群,这个大数据工具会有帮助吗? 文件如下所示: 文件1 校长1,校长3,校长5 a、 b,c 文件2 校长4,校长2 e、 f 文件3 校长2,校长6 a, c 我想把它们合并成: 校长1,校长2,校长3,校长4,校长5,校

  • 问题内容: 我有以下数据集。 https://drive.google.com/drive/folders/1NRelNsXQJ7MTNKcm-T69N6r5ZsOyFmTS?usp=sharing 如果列名称与工作表名称相同,则将所有内容合并在一起作为单独的列,以下是代码 运行以上代码后的数据 merged_data 如何合并条件文件? 健康)状况。 以上代码段中的价格1指向带有名称为int 7

  • 我有一些具有相同列标题的CSV文件。例如 文件A 文件B 我想合并它,以便将数据合并到一个文件中,标题在顶部,但其他地方没有标题。 实现这一目标的好方法是什么?

  • 我有2个docker compose文件需要一起运行,文件的位置如下 /home/project1/docker compose。yml公司 和 /home/project2/docker-compose.yml 所以很明显两个服务应该有不同的上下文路径 但当我在docker下运行compose命令时 docker compose-f/home/project1/docker compose。ym

  • 问题内容: 我有一些具有相同列标题的CSV文件。例如 文件A 文件B 我想将其合并,以便将数据合并到一个文件中,文件头位于顶部,但其他任何地方都没有文件头。 什么是实现此目标的好方法? 问题答案: 这应该工作。它检查要合并的文件是否具有匹配的头。否则将引发异常。异常处理(关闭流等)已作为练习。

  • 目前,我正在以以下方式使用Jasper Reports生成一个pdf文件。 我创建了一个名为“singlePagePdf.jrxml”的jrxml文件 我声明了一个bean如下 因此,当调用相关URL时,我的控制器将按如下方式处理它。 目前,它成功地为一名员工生成了报告。现在,我想以相同的格式(使用相同的jrxml文件)为员工列表创建报告,并将其输出到单个pdf文件(即包含多个员工工作细节记录的p