问题：

NumPy数组中每行的唯一元素数

别锐

2023-03-14

例如，对于

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

我想得到

[2, 2, 3]

有没有办法不用for循环或使用np.vectorize？

编辑：实际数据由1000行组成，每行100个元素，每个元素的范围从1到365。最终目标是确定有重复的行的百分比。这是一个作业问题，我已经解决了（用for循环），但我只是想知道是否有更好的方法来做它与Numpy。

共有3个答案

孙乐逸

2023-03-14

你愿意考虑熊猫吗？数据帧有一个专门的方法来实现这一点

>>> a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
>>> df = pd.DataFrame(a.T)
>>> print(*df.nunique())
2 2 3

郎鸿

2023-03-14

此解决方案通过np.apply_沿_轴实现，它不是矢量化的，并且涉及Python级别的循环。但是使用lennp是相对直观的。独特的函数。

import numpy as np
from toolz import compose

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

np.apply_along_axis(compose(len, np.unique), 1, a)    # [2, 2, 3]

龚承嗣

2023-03-14

方法#1

一种带排序的矢量化方法-

In [8]: b = np.sort(a,axis=1)

In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])

进近#2

对于不是很大的int，另一种方法是用偏移量抵消每行，偏移量将每行的元素与其他元素区分开来，然后进行二进制求和和计算每行的非零箱数-

n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

作为函数逼近-

def sorting(a):
    b = np.sort(a,axis=1)
    return (b[:,1:] != b[:,:-1]).sum(axis=1)+1

def bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

# From @wim's post   
def pandas(a):
    df = pd.DataFrame(a.T)
    return df.nunique()

# @jp_data_analysis's soln
def numpy_apply(a):
    return np.apply_along_axis(compose(len, np.unique), 1, a)

案例#1：方形

In [164]: np.random.seed(0)

In [165]: a = np.random.randint(0,5,(10000,10000))

In [166]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop

案例2：大量行

In [167]: np.random.seed(0)

In [168]: a = np.random.randint(0,5,(1000000,10))

In [169]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop

扩展到每列的唯一元素数

为了扩展，我们只需要沿着两个提议的方法的另一个轴进行切片和ufunc操作，就像这样-

def nunique_percol_sort(a):
    b = np.sort(a,axis=0)
    return (b[1:] != b[:-1]).sum(axis=0)+1

def nunique_percol_bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[1]))*n
    M = a.shape[1]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

让我们看看如何扩展到泛型维度的数组，并获得沿着泛型轴的唯一计数数。我们将使用np.diff及其轴参数来获得这些连续的差异，从而使其具有通用性，如下所示-

def nunique(a, axis):
    return (np.diff(np.sort(a,axis=axis),axis=axis)!=0).sum(axis=axis)+1

样本运行-

In [77]: a
Out[77]: 
array([[1, 0, 2, 2, 0],
       [1, 0, 1, 2, 0],
       [0, 0, 0, 0, 2],
       [1, 2, 1, 0, 1],
       [2, 0, 1, 0, 0]])

In [78]: nunique(a, axis=0)
Out[78]: array([3, 2, 3, 2, 3])

In [79]: nunique(a, axis=1)
Out[79]: array([3, 3, 2, 3, 3])

如果您使用的是浮动pt编号，并且希望基于某个公差值而不是绝对匹配来生成唯一性大小写，我们可以使用np.isclose。有两个这样的选择-

(~np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0)).sum(axis)+1
a.shape[axis]-np.isclose(np.diff(np.sort(a,axis=axis),axis=axis),0).sum(axis)

对于自定义公差值，使用np.isclose输入这些值。

NumPy数组中每行的唯一元素数

共有3个答案

相关问答

相关文章

相关阅读

相关工具

相关文档