PySpark中对应的Scala案例类是什么？

申屠项明

2023-03-14

问题内容：

您将如何在PySpark中使用和/或实现等效的案例类？

问题答案：

正如Alex
Hall[所提到的，命名产品类型的真实等效项是namedtuple。

与Row在其他答案中建议的不同，它具有许多有用的属性：

具有明确定义的形状，可以可靠地用于结构模式匹配：
```
>>> from collections import namedtuple
```
FooBar = namedtuple(“FooBar”, [“foo”, “bar”])
foobar = FooBar(42, -42)
foo, bar = foobar
foo
42
bar
-42

相反Rows
，与关键字参数一起使用时并不可靠：

    >>> from pyspark.sql import Row
>>>
>>> foobar = Row(foo=42, bar=-42)
>>> foo, bar = foobar
>>> foo
-42
>>> bar
42

尽管如果使用位置参数定义：

    >>> FooBar = Row("foo", "bar")
>>> foobar = FooBar(42, -42)
>>> foo, bar = foobar
>>> foo
42
>>> bar
-42

订单被保留。

定义适当的类型
```
>>> from functools import singledispatch
```
FooBar = namedtuple(“FooBar”, [“foo”, “bar”])
type(FooBar)

isinstance(FooBar(42, -42), FooBar)
True

可以在需要类型处理的任何地方使用，尤其是对于单个：

    >>> Circle = namedtuple("Circle", ["x", "y", "r"])
>>> Rectangle = namedtuple("Rectangle", ["x1", "y1", "x2", "y2"])
>>>
>>> @singledispatch
... def area(x):
...     raise NotImplementedError
... 
... 
>>> @area.register(Rectangle)
... def _(x):
...     return abs(x.x1 - x.x2) * abs(x.y1 - x.y2)
... 
... 
>>> @area.register(Circle)
... def _(x):
...     return math.pi * x.r ** 2
... 
... 
>>>
>>> area(Rectangle(0, 0, 4, 4))
16
>>> >>> area(Circle(0, 0, 4))
50.26548245743669

和多个调度：

    >>> from multipledispatch import dispatch
>>> from numbers import Rational
>>>
>>> @dispatch(Rectangle, Rational)
... def scale(x, y):
...     return Rectangle(x.x1, x.y1, x.x2 * y, x.y2 * y)
... 
... 
>>> @dispatch(Circle, Rational)
... def scale(x, y):
...     return Circle(x.x, x.y, x.r * y)
...
...
>>> scale(Rectangle(0, 0, 4, 4), 2)
Rectangle(x1=0, y1=0, x2=8, y2=8)
>>> scale(Circle(0, 0, 11), 2)
Circle(x=0, y=0, r=22)

并结合第一个属性，可以在广泛的模式匹配场景中使用。namedtuples还支持标准继承和类型提示。

Rows 别：

    >>> FooBar = Row("foo", "bar")
>>> type(FooBar)
<class 'pyspark.sql.types.Row'>
>>> isinstance(FooBar(42, -42), FooBar)  # Expected failure
Traceback (most recent call last):
...
TypeError: isinstance() arg 2 must be a type or tuple of types
>>> BarFoo = Row("bar", "foo")
>>> isinstance(FooBar(42, -42), type(BarFoo))
True
>>> isinstance(BarFoo(42, -42), type(FooBar))
True

提供高度优化的表示形式。与Row对象不同，元组不在__dict__每个实例中使用和携带字段名称。结果，初始化可以快几个数量级：
```
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
```
%timeit FooBar(42, -42)
587 ns ± 5.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

与不同的Row构造函数相比：

    >>> %timeit Row(foo=42, bar=-42)
3.91 µs ± 7.67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> FooBar = Row("foo", "bar")
>>> %timeit FooBar(42, -42)
2 µs ± 25.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

并且显着提高了内存效率（在处理大规模数据时，这是非常重要的属性）：

    >>> import sys
>>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> sys.getsizeof(FooBar(42, -42))
64

与同等相比 Row

    >>> sys.getsizeof(Row(foo=42, bar=-42))
72

最后，使用以下命令可以更快地访问属性namedtuple：

    >>> FooBar = namedtuple("FooBar", ["foo", "bar"])
>>> foobar = FooBar(42, -42)
>>> %timeit foobar.foo
102 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

与Row对象的等效操作相比：

    >>> foobar = Row(foo=42, bar=-42)
>>> %timeit foobar.foo
2.58 µs ± 26.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

最后但并非最不重要namedtuples的一点在Spark SQL中得到正确支持
```
>>> Record = namedtuple("Record", ["id", "name", "value"])
```
spark.createDataFrame([Record(1, “foo”, 42)])
DataFrame[id: bigint, name: string, value: bigint]

总结：

应该清楚的Row是，它不能真正替代实际的产品类型，并且应避免使用，除非由Spark
API强制执行。

还应该清楚的pyspark.sql.Row是，当您考虑到它不是case类的替代品时，它直接等效于-
type，它与org.apache.spark.sql.Row实际产品相差很远，并且其行为类似于Seq[Any]（取决于子类，并添加了名称）
）。引入Python和Scala实现都是有用的，尽管外部代码和内部Spark SQL表示形式之间的接口笨拙。

另请参阅 ：

更不用说由李浩一（Li Haoyi）开发的出色MacroPy及其端口（AlbertP Berti）了：
```
>>> import macropy.console
```
0=[]=====> MacroPy Enabled <=====[]=0

from macropy.case_classes import macros, case
@case
… class FooBar(foo, bar): pass
…
foobar = FooBar(42, -42)
foo, bar = foobar
foo
42
bar
-42

它具有许多其他功能，包括但不限于高级模式匹配和简洁的lambda表达式语法。

Python dataclasses（Python 3.7+）。

PySpark中对应的Scala案例类是什么？

相关阅读

相关文章

相关问答

相关工具

相关文档