问题：

使用open_mfdataset打开多个带有xarray/dask的netcdf的最有效方法（在我的例子中异常缓慢）

许照

2023-03-14

我有一个363个netcdf文件的目录，对应不同的时间（所有文件都有类似的内部结构，“时间”维度为1），每个270MB，总共大约100GB。我想在一个xarray中上传所有这些数据（使用dask数组和块）。似乎open_mfdataset是合适的解决方案，但我似乎没有正确地使用它，因为它非常慢。

# Import modules                                                                                                                                             
import time
import numpy as np
import xarray as xr
from dask.distributed import Client

client = Client() 

# Define variables of interest                                                                                                                               
vars = ['nitrogendioxide_tropospheric_column_qafiltered']

# Read data                                                                                                                                                  
start = time.time()
dir = '/data_directory/'
ds = xr.open_mfdataset('{}/*2019*.nc'.format(dir), engine='netcdf4', combine='nested', concat_dim='time', parallel=True)
ds.close()
print(' | size(ds)/duration = {:0.2f}GB / {:0.2f}s'.format(ds.nbytes / 1e9,time.time()-start))

执行此操作所需的时间为：size（ds）/duration=98.83GB/1746.73s。为什么这么慢？

请注意，如果我没有将client=client（）和parallel=True放在一起，它不会显著改变时间，因此我有点困惑。

注意：此测试在HPC设施中的交互式会话上执行：

>>> client  
<Client: 'tcp://127.0.0.1:43651' processes=4 threads=4, memory=33.78 GB>

NBbis：得到的xArray是：

>>> ds
<xarray.Dataset>
Dimensions:                                                    (corner: 4, time: 363, x: 1028, x_b: 1029, y: 649, y_b: 650)
Coordinates:
    lat                                                        (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
    lon                                                        (y, x) float64 dask.array<chunksize=(649, 1028), meta=np.ndarray>
  * time                                                       (time) datetime64[ns] 2019-01-01T05:00:00 ... 2019-12-31T05:00:00
Dimensions without coordinates: corner, x, x_b, y, y_b
Data variables:
    nitrogendioxide_tropospheric_column                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_precision              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    qa_value                                                   (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    latitude_bounds                                            (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    longitude_bounds                                           (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    solar_zenith_angle                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_azimuth_angle                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_zenith_angle                                       (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_azimuth_angle                                      (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_fraction_crb                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_altitude                                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_albedo                                             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_classification                                     (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_pressure                                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_radiance_fraction_nitrogendioxide_window             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column                       (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_precision             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    degrees_of_freedom                                         (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    one                                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    latitude_bounds_qafiltered                                 (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    longitude_bounds_qafiltered                                (time, corner, y, x) float64 dask.array<chunksize=(1, 4, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_qafiltered             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_tropospheric_column_precision_qafiltered   (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    qa_value_qafiltered                                        (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_radiance_fraction_nitrogendioxide_window_qafiltered  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    cloud_fraction_crb_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_altitude_qafiltered                                (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_albedo_qafiltered                                  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_classification_qafiltered                          (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    surface_pressure_qafiltered                                (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_zenith_angle_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    solar_azimuth_angle_qafiltered                             (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_zenith_angle_qafiltered                            (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    viewing_azimuth_angle_qafiltered                           (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_qafiltered            (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    nitrogendioxide_stratospheric_column_precision_qafiltered  (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    degrees_of_freedom_qafiltered                              (time, y, x) float64 dask.array<chunksize=(1, 649, 1028), meta=np.ndarray>
    lat_b                                                      (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
    lon_b                                                      (time, y_b, x_b) float64 dask.array<chunksize=(1, 650, 1029), meta=np.ndarray>
Attributes:
    regrid_method:  conservative
    history:        read PRODUCT group...

我查看了其他帖子，但没有找到这个问题的答案。谢谢你的帮助。

共有1个答案

索令

2023-03-14

看看这个github问题——似乎有很多人对open_mfdataset的性能有问题，目前没有明显的解决方案。

类似资料：

带dask sel的xarray速度很慢

使用xarray的open_mfdataset打开一系列大约90个netCDF文件，每个文件大约27MB，加载一个小的时空选择需要很长时间。分块维度产生边际收益。decode_cf=True在函数内部或单独都没有区别。这里还有一个建议https://groups.google.com/forum/#!topic/xarray/11lDGSeza78让我将所选内容另存为一个单独的netCdf并重新
使用Xarray将数据从netCDF文件提取到高数据帧中的有效方法

我有一个大约350个坐标的列表，这些坐标是指定区域内的坐标，我想使用Xarray从netCDF文件中提取这些坐标。如果它是相关的，我试图从一个特定的地表模型中提取SWE(雪水当量)数据。我的问题是这个 for 循环需要永远遍历列表中的每个项目并获取相关的时间序列数据。也许在某种程度上这是不可避免的，因为我必须从每个坐标的 netCDF 文件中实际加载数据。我需要帮助的是以任何可能的方式加速代码。
调用另一个类中带有异常的方法

我有一个包含许多方法的类，这些方法可以产生问题，所以我为这些方法实现了异常处理。现在我想在另一个类中使用这些方法。我是否需要再次通过try和get来处理这些异常，或者我只需要调用该方法就完成了？
“带有最终异常的精确重演”在JavaSE8中有效吗？

嗨大家好，我的JDK版本是8u45，现在是最新的。我想知道“最后一个例外的精确重演”在SE 8中仍然有效吗？作为代码，如果我去掉“抛出异常”，这将是编译错误，但根据SE7的“最终异常精确重试”函数，应该可以忽略它。另一个问题是，我们都知道如果嵌套的try框中发生了异常，我们仍然应该将其抛出到外部捕获框以避免编译错误，我最初认为我们只需要抛出任何类型的异常但是如果我像下面的代码那样修改它：
具有多个内部异常并使用流的异常

如何处理要报告多个问题的文件上的多个异常。我有一个多处理步骤的情况，其中不同的异常可以发生（例如，它们将在以后异步）。我使用（可能是快速失败的反模式）异常列表，然后一旦他们完成并检查异常我有自己定制的异常类别（针对每个异步任务）（扩展类），但实现了一个接口，以包含消息特定的键值对等附加信息示例实现我的问题是，如果我知道我在每个任务中创建这些异常对象时会遇到什么问题，但我不会抛出它们。但如果
使用xarray从netcdf中提取最近的经度和时间值

我有一个netCDF文件。结构: 如何提取特定经纬度(比如86.45，-156.25)和时间(比如2016-01-10)的网格单元的值？精确的纬度/经度值可能不在坐标中，在这种情况下，我们需要最接近的纬度/经度值我可以像这样提取特定经度的值：然而，由于-20在经度坐标中不存在，因此这不起作用。

使用open_mfdataset打开多个带有xarray/dask的netcdf的最有效方法（在我的例子中异常缓慢）

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档