对Text类型的数据，使用getLength()与getBytes().length获取的长度不同解析

束向荣

2023-12-01

我们先来看一个小Demo：

Text t = new Text("hadoop");

t.set("pig");

System.out.println(t.getLength());

System.out.println(t.getBytes().length);

不出意外，输出的结果都是3。但是这个时候如果我们把set方法中的参数改为如下类型将会发生什么事呢？

t.set(new Text("pig"));

运行程序，结果却发生了变化，分别输出了3和6。这是怎么回事呢？别急，我们通过源码看能否得到想要的回答。我们跟踪set接收的参数类型为Text的方法，如下：

/** copy a text. */

public void set(Text other) {

    set(other.getBytes(), 0, other.getLength());

}

继续跟进set方法，如下：

public void set(byte[] utf8, int start, int len) {

    setCapacity(len, false);

    System.arraycopy(utf8, start, bytes, 0, len);
    
    this.length = len;

}

我们对当前这个set方法，进行研究一番。首先跟进setCapacity方法：

/*

* Sets the capacity of this Text object to <em>at least</em>

* <code>len</code> bytes. If the current buffer is longer,

* then the capacity and existing content of the buffer are

* unchanged. If <code>len</code> is larger

* than the current capacity, the Text object's capacity is

* increased to match.

* @param len the number of bytes we need

* @param keepData should the old data be kept

*/

private void setCapacity(int len, boolean keepData) {

    if (bytes == null || bytes.length < len) {

        if (bytes != null && keepData) {

            bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));

        } else {

            bytes = new byte[len];

        }

    }

}

通过方法注释我们很容易知道，当Text当前的buffer长度大于传过来的len，这个时候buffer的长度和其存储的内容都是不会发生改变的。但是，如果len大于当前buffer的长度时候，Text对象的长度就需要进行增加。

所以，当我们先通过Text t = new Text("hadoop");创建Text对象的时候，buffer里面存储了hadoop，且长度为6。这个时候，我们再使用t.set(new Text("pig"));改变buffer的值的时候，就会调用上述的方法。我们知道buffer的长度大于新设置的Text的长度，这个时候，buffer将不会发生任何改变。

紧接着执行System.arraycopy(utf8, start, bytes, 0, len);这是系统本地方法，目的是将utf8按长度复制到bytes（即上述的buffer）中，其余长度中的不会发生改变。

比如说，将pig复制到hadoop，结果为pigoop。

此时，再通过this.length = len;更新下Text对象的长度。

我们再来回顾下小Demo中的输出程序中的关键方法：

System.out.println(t.getLength());

System.out.println(t.getBytes().length);

对于t.getLength()我们跟进程序很明了的知道他是直接返回Text对象的length，即3。

/** Returns the number of bytes in the byte array */

@Override

public int getLength() {

    return length;

}

对于t.getBytes().length，我们是先获得Text对象的bytes，即pigoop，然后在求出其长度，为6。

但是，问题来了，为什么使用t.set("pig");的时候长度都是为3呢？很显然，调用的set方法的实现并不相同，我们来看看：

public void set(String string) {

    try {

        ByteBuffer bb = encode(string, true);

        bytes = bb.array();

        length = bb.limit();

   }catch(CharacterCodingException e) {

        throw new RuntimeException("Should not have happened ", e);

    }

}

上面代码的第一局将制定的string使用utf-8格式转化为字节buffer（Converts the provided String to bytes using the UTF-8 encoding.）。然后，将字节buffer通过array方法转化为字节数组bytes。并通过limit方法返回buffer的长度赋值给length。此时，无论通过t.getLength();还是t.getBytes().length都将返回相同的长度。

对Text类型的数据，使用getLength()与getBytes().length获取的长度不同解析

相关阅读

相关文章

相关问答

相关文档