python xarray DataArray 用法

於子晋

2023-12-01

xarray.DataArray 是一个使用标签的多维数组，主要有以下几个关键属性：

values：一个保存数组值的numpy.ndarray
dims: 每个坐标轴的维度名称 (例如， (‘x’, ‘y’, ‘z’))-
coords: 一个包含数组坐标的类似字典的容器，用来标记每个点（例如，数字，日期时间对象或字符串的一维数组）
attrs: 存放任意元数据（属性）的字典

xarray使用dims和coords来实现其核心元数据的感知操作。维度(Dimensions)提供xarray使用的名称，而不是许多numpy函数中的axis参数。坐标(Coordinates)基于pandas的DataFrame或Series上的索引(index)功能，可实现基于标签的快速索引和对齐。

DataArray对象也可以具有一个名称(name)，并可以使其以attrs属性的形式保存任意元数据。名称和属性仅供用户和用户编写的代码查阅：xarray不会尝试解释它们，并且仅在明确的情况下使用它们（请参阅常见问题，What is your approach to metadata?）

创建一个 DataArray

构造DataArray函数需要：

data：包含值的多维数组（例如一个numpy ndarray，Series，DataFrame或pandas.Panel）
coords：一个包含坐标的列表或字典。如果是列表，则应为元组列表，其中第一个元素是维名称，第二个元素是对应的坐标类似array的对象。
dims：包含维名称的列表。如果省略，并且coords是包含元组的列表，则维度名称取自coords。
attrs：添加到实例的属性字典
name：命名实例的字符串

In [1]: data = np.random.rand(4, 3)

In [2]: locs = ['IA', 'IL', 'IN']

In [3]: times = pd.date_range('2000-01-01', periods=4)

In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [5]: foo
Out[5]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

只有data是必须的；其他所有参数将使用默认值填充：

In [6]: xr.DataArray(data)
Out[6]: 
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Dimensions without coordinates: dim_0, dim_1

如您所见，维度名称始终存在于xarray数据模型中：如果不提供，则将以默认dim_N格式创建。但是，坐标始终是可选的，并且维度没有自动的坐标标签。

注意:
这点与pandas不同，在pandas中经常会有刻度标签，默认为整数[0, ..., n-1]。
在xarray v0.9之前，xarray套用了此行为：如果未显式提供坐标，则会为每个维度创建默认坐标。目前已不是这种情况。

坐标(Coordinates)可以通过以下方式指定：

长度等于维数的列表，为每个维度提供坐标标签。每个对应的值都必须采用以下形式之一：
- 一个 DataArray 或 Variable
- 格式为(dims, data[, attrs])的元组，将会被转换为Variable的参数
- 一个pandas对象或标量值，将会被转换为DataArray
- 一维数组或列表，将被解释为一维坐标变量的值，以及与之对应的维度名称。
形式为{coord_name: coord}的字典，其中值的形式与列表相同。以字典的形式提供坐标，允许除了对应的维度的坐标以外的其他坐标（稍后会详细介绍）。如果将coords作为字典提供，则必须显式提供dims。

以包含元组的列表提供：

In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

以字典提供：

In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': ('space', [1, 2, 3])},
   ...:              dims=['time', 'space'])
   ...: 
Out[8]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (space) int64 1 2 3

以具有多个维度坐标的字典提供：

In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
   ...:              dims=['time', 'space'])
   ...: 
Out[9]: 
<xarray.DataArray (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    const    int64 42
    ranking  (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11

如果通过提供pandas的 Series，DataFrame或pandas.Panel创建DataArray，则将从pandas对象中填充DataArray构造函数中所有未指定的参数：

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
  * abc      (abc) object 'a' 'b'
  * xyz      (xyz) object 'x' 'y'

DataArray属性

让我们看一下array上的重要属性：

In [15]: foo.values
Out[15]: 
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

In [18]: foo.attrs
Out[18]: {}

In [19]: print(foo.name)
None

可以就地修改value：

In [20]: foo.values = 1.0 * foo.values

注意：
DataArray中的数组值具有单个（均一）数据类型。要使用xarray中的异构或结构化数据类型，请使用坐标，或将单独的DataArray对象放在单个Dataset中（请参见下文）。

现在，填写一些缺少的元数据：

In [21]: foo.name = 'foo'

In [22]: foo.attrs['units'] = 'meters'

In [23]: foo
Out[23]: 
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters

rename()方法是另一个选项，返回一个新的数据数组：

In [24]: foo.rename('bar')
Out[24]: 
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[0.127, 0.967, 0.26 ],
       [0.897, 0.377, 0.336],
       [0.451, 0.84 , 0.123],
       [0.543, 0.373, 0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
Attributes:
    units:    meters

DataArray坐标(Coordinates)

coords属性类似于字典(dict)。单个坐标可以从坐标中按名称访问，甚至可以通过索引数据数组本身来访问：

In [25]: foo.coords['time']
Out[25]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

In [26]: foo['time']
Out[26]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

这些也是DataArray对象，其中包含每个维度的刻度标签。

也可以使用字典来设置或删除坐标，例如语法：

In [27]: foo['ranking'] = ('space', [1, 2, 3])

In [28]: foo.coords
Out[28]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'
    ranking  (space) int64 1 2 3

In [29]: del foo['ranking']

In [30]: foo.coords
Out[30]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) <U2 'IA' 'IL' 'IN'

更多详细内容，请参阅 Coordinates

python xarray DataArray 用法

创建一个 DataArray

DataArray属性

DataArray坐标(Coordinates)

相关阅读

相关文章

相关问答

相关文档