问题：

从多个大型 NetCDF 文件中提取数据的快速/高效方法

寿鸣

2023-03-14

我需要从全球网格中提取特定节点集的数据，这些节点由纬度/经度坐标给出(大约5000-10000)。这些数据是水力参数的时间序列，例如波高。

全球数据集是巨大的，所以它被分成许多NetCDF文件。每个NetCDF文件大约5GB，包含整个全球网格的数据，但只针对一个变量（例如波高）和一年（例如2020年）。假设我想在某个位置提取6个变量的完整时间序列（42年），我需要提取数据形式为6x42=252个NC文件，每个文件大小为5GB。

我目前的方法是通过年份、变量和节点的三重循环。我使用Xarray打开每个NC文件，提取所有所需节点的数据，并将其存储在一个字典中。一旦我提取了字典中的所有数据，我就为每个位置创建一个pd.dataframe，并将其存储为pickle文件。有6个变量和42年，这导致每个位置大约7-9 MB的pickle文件(所以实际上不是很大)。

如果我有少量的位置，我的方法工作得非常好，但一旦它增长到几百个，这种方法就需要非常长的时间。我的直觉是这是一个记忆问题（因为所有提取的数据都首先存储在一个字典中，直到每年和变量都被提取出来）。但我的一位同事表示，Xarray实际上效率很低，这可能会导致持续时间过长。

这里有人有类似问题的经验，或者知道从大量NC文件中提取数据的有效方法吗？我把我目前使用的代码放在下面。感谢您的任何帮助！

# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1)   # year range
ndata = {}                      # dictionary which will contain all data

# loop through all the desired variables
for v in vars.keys():
    ndata[v] = {}

    # For each variable, loop through each year, open the nc file and extract the data
    for y in years:
        
        # Open file with xarray
        fname = 'xxx.nc'
        data = xr.open_dataset(fname)
        
        # loop through the locations and load the data for each node as temp
        for n in range(len(nodes)):
            node = nodes.node_id.iloc[n]
            lon = nodes.lon.iloc[n]
            lat = nodes.lat.iloc[n]    
            
            temp = data.sel(longitude=lon, latitude=lat)
            
            # For the first year, store the data into the ndata dict
            if y == years[0]:
                ndata[v][node] = temp
            # For subsequent years, concatenate the existing array in ndata
            else:
                ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

# merge the variables for the current location into one dataset
for n in range(len(nodes)):
    node = nodes.node_id.iloc[n]
    
    dset = xr.merge(ndata[v][node] for v in variables.keys())
    df = dset.to_dataframe()

    # save dataframe as pickle file, named by the node id
    df.to_pickle('%s.xz'%(node)))

关玮

2023-03-14

这是一个非常常见的工作流程，所以我会给出一些指示。一些建议的更改，最重要的更改在前

使用xray的高级索引一次选择所有点

看起来您正在使用包含列< code>'lat '，' lon '和' node_id'的pandas数据帧< code>nodes。就像python中的几乎所有东西一样，尽可能移除内部for循环，利用用c编写的基于数组的操作。

# create an xr.Dataset indexed by node_id with arrays `lat` and `lon
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()

# select all points from each file simultaneously, reshaping to be
# indexed by `node_id`
node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon)

# dump this reshaped data to pandas, with each variable becoming a column
node_df = node_data.to_dataframe()

仅整形数组一次

在您的代码中，您循环了很多年，每年在第一个之后，您都会分配一个新数组，该数组具有足够的内存来保存到目前为止存储的年份。

# For the first year, store the data into the ndata dict
if y == years[0]:
    ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
    ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')

相反，只需收集所有年份的数据并在最后连接它们。这只会为所有数据分配一次所需的数组。

使用dq，例如使用xr.open_mfdataset来利用多个内核。如果您这样做，您可能需要考虑使用支持多线程写入的格式，例如zarr

总之，这可能看起来像这样:

# build nested filepaths
filepaths = [
    ['xxx.nc'.format(year=y, variable=v) for y in years
    for v in variables
]

# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()

# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
    filepaths,
    combine='nested',
    concat_dim=['variable', 'year'],
    parallel=True,
)

# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)

# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')

# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

从多个大型 NetCDF 文件中提取数据的快速/高效方法

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档