Hive之ObjectInspector接口解析笔记

汪和悌
2023-12-01

目录

1.ObjectInspector接口源码:

2.ObjectInspector接口注释:

3.ObjectInspector接口中Category:

4.ObjectInspector接口中getTypeName()方法:

5.ObjectInspector接口中getCategory()方法: 

6.工厂方法创建ObjectInspector实例:

7.利用ObjectInspector解析Object数据:


------本文笔记由阅读源码及其英文注释整理得

1.ObjectInspector接口源码:

package org.apache.hadoop.hive.serde2.objectinspector;

/**
 * ObjectInspector helps us to look into the internal structure of a complex
 * object.
 *
 * A (probably configured) ObjectInspector instance stands for a specific type
 * and a specific way to store the data of that type in the memory.
 *
 * For native java Object, we can directly access the internal structure through
 * member fields and methods. ObjectInspector is a way to delegate that
 * functionality away from the Object, so that we have more control on the
 * behavior of those actions.
 *
 * An efficient implementation of ObjectInspector should rely on factory, so
 * that we can make sure the same ObjectInspector only has one instance. That
 * also makes sure hashCode() and equals() methods of java.lang.Object directly
 * works for ObjectInspector as well.
 */
public interface ObjectInspector extends Cloneable {

  /**
   * Category.
   *
   */
  public static enum Category {
    PRIMITIVE, LIST, MAP, STRUCT, UNION
  };

  /**
   * Returns the name of the data type that is inspected by this
   * ObjectInspector. This is used to display the type information to the user.
   *
   * For primitive types, the type name is standardized. For other types, the
   * type name can be something like "list<int>", "map<int,string>", java class
   * names, or user-defined type names similar to typedef.
   */
  String getTypeName();

  /**
   * An ObjectInspector must inherit from one of the following interfaces if
   * getCategory() returns: PRIMITIVE: PrimitiveObjectInspector LIST:
   * ListObjectInspector MAP: MapObjectInspector STRUCT: StructObjectInspector.
   */
  Category getCategory();
}

2.ObjectInspector接口注释:

1)ObjectInspector帮助我们研究复杂对象的内部结构。

2)一个(可能已配置的)ObjectInspector实例代表了一个类型的数据在内存中存储的特定类型和方法。

3)对于本机Java对象,我们可以通过成员变量和方法直接访问其内部结构。ObjectInspector是一种将该功能委托给对象之外的方法,这样我们可以更好地控制这些操作的行为。

4)ObjectInspector的有效实现应该依赖于工厂,因此我们可以确保同一个ObjectInspector只有一个实例。这同时也要确保java.lang.object的hashcode()和equals()方法可以直接用于ObjectInspector。

3.ObjectInspector接口中Category:

一个枚举类Category,定义了5种类型:基本类型(Primitive),集合(List),键值对映射(Map),结构体(Struct),联合体(Union)。

其中基本类型为Hive支持的基本类型:byte\short\int\long\float\double\string\boolean等等。

(更多详细类型如下 PrimitiveObjectInspector 接口中的枚举 PrimitiveCategory 代码所示:)

package org.apache.hadoop.hive.serde2.objectinspector;

public interface PrimitiveObjectInspector extends ObjectInspector {

  /**
   * Hive支持的基本类型.
   */
  public static enum PrimitiveCategory {
    VOID, BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, STRING,
    DATE, TIMESTAMP, BINARY, DECIMAL, VARCHAR, CHAR, INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME,
    UNKNOWN
  };

}

4.ObjectInspector接口中getTypeName()方法:

1)返回由ObjectInspector检查到的数据类型的名称,用于向用户显示类型信息。

2)对于基本类型,类型名是标准化的。对于其他类型,类型名称可以是类似于“列表List< int >”、“键值对Map < int、String >”、Java类名、或用户自定义的类型名称。

5.ObjectInspector接口中getCategory()方法: 

1)如果 getCategory() 返回 PRIMITIVE,则 ObjectInspector 子类需要继承 PrimitiveObjectInspector 接口;

2)如果 getCategory() 返回 LIST,则 ObjectInspector 子类需要继承 ListObjectInspector 接口;

3)如果 getCategory() 返回 MAP,则 ObjectInspector 子类需要继承 MapObjectInsector 接口;

4)如果 getCategory() 返回 STRUCT,则 ObjectInspector 子类需要继承 StructObjectInsector 接口。

这些细分的子接口提供了更详细、更特殊的数据类型操作定义。

(存放在包 org.apache.hadoop.hive.serde2.objectinspector 下)

另外,在包 org.apache.hadoop.hive.serde2.objectinspector.primitive 下还提供了更为细分的基本类型接口,如

package org.apache.hadoop.hive.serde2.objectinspector.primitive;

import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;

/**
 * A LongObjectInspector inspects an Object representing a Long.
 */
public interface LongObjectInspector extends PrimitiveObjectInspector {

  /**
   * Get the long data.
   */
  long get(Object o);
}
package org.apache.hadoop.hive.serde2.objectinspector.primitive;

import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.io.Text;

/**
 * A StringObjectInspector inspects an Object representing a String.
 */
public interface StringObjectInspector extends PrimitiveObjectInspector {

  /**
   * Get the Text representation of the data.
   */
  Text getPrimitiveWritableObject(Object o);

  /**
   * Get the String representation of the data.
   */
  String getPrimitiveJavaObject(Object o);
}

6.工厂方法创建ObjectInspector实例:

1)基本类型的OI实例由工厂类 PrimitiveObjectInspectorFactory 创建;

    (工厂类在包 org.apache.hadoop.hive.serde2.objectinspector.primitive 下)

1.1)工厂中提供各个基本类型的Writable型ObjectInspector和Java型ObjectInspector的静态缓存,如:

//Writable型:Hive支持的Writable类型
public static final WritableLongObjectInspector writableLongObjectInspector =
      new WritableLongObjectInspector();
//Java型:Java支持的基本类型
public static final JavaLongObjectInspector javaLongObjectInspector =
      new JavaLongObjectInspector();

注:这里有缓存的原因是,因为ObjectInspector没有内部状态。由此是为了具有相同构造参数的ObjectInsector能够产生完全相同的ObjectInsector。 

 1.2)实例化方式:

//通过枚举类PrimitiveCategory获取Writable型OI实例
getPrimitiveWritableObjectInspector(PrimitiveCategory primitiveCategory);
//通过枚举类PrimitiveCategory获取Java型OI实例
getPrimitiveJavaObjectInspector(PrimitiveCategory primitiveCategory);

//通过typeInfo获取Writable型OI实例
getPrimitiveWritableObjectInspector(PrimitiveTypeInfo typeInfo);
//通过typeInfo获取Java型OI实例
getPrimitiveJavaObjectInspector(PrimitiveTypeInfo typeInfo);

//通过 Hive的Writable类型 或者 Java的基本类型 获取对应的OI实例
getPrimitiveObjectInspectorFromClass(Class<?> c);

//获取 常量类型 的OI实例
getPrimitiveWritableConstantObjectInspector(PrimitiveTypeInfo typeInfo, Object value);

2)其它类型的OI实例由工厂类 ObjectInspectorFactory 创建。

    (工厂类在包 org.apache.hadoop.hive.serde2.objectinspector 下)

2.1)工厂中提供各个其它类型的ObjectInspector的静态缓存及实例化方式,如:

//获取List型的OI实例变量
getStandardListObjectInspector(ObjectInspector listElementObjectInspector);
//获取List型的OI实例常量,需指定常量List
getStandardConstantListObjectInspector(ObjectInspector listElementObjectInspector, List<?> constantValue);

//获取Map型的OI实例变量
getStandardMapObjectInspector(
      ObjectInspector mapKeyObjectInspector,
      ObjectInspector mapValueObjectInspector);
//获取Map型的OI实例常量
getStandardConstantMapObjectInspector(
      ObjectInspector mapKeyObjectInspector,
      ObjectInspector mapValueObjectInspector,
      Map<?, ?> constantValue);

//获取Union型的OI实例变量
getStandardUnionObjectInspector(
      List<ObjectInspector> unionObjectInspectors);

//获取Struct型的OI实例变量
getStandardStructObjectInspector(
      List<String> structFieldNames,
      List<ObjectInspector> structFieldObjectInspectors);
//获取Struct型的OI实例变量,带注释
getStandardStructObjectInspector(
      List<String> structFieldNames,
      List<ObjectInspector> structFieldObjectInspectors,
      List<String> structComments);
//获取Struct型的OI实例常量
getStandardConstantStructObjectInspector(
    List<String> structFieldNames,
    List<ObjectInspector> structFieldObjectInspectors,  List<?> value);
//获取UnionStruct型的OI实例
getUnionStructObjectInspector(
      List<StructObjectInspector> structObjectInspectors);

//获取Struct中Field的OI实例
getColumnarStructObjectInspector(
      List<String> structFieldNames,
      List<ObjectInspector> structFieldObjectInspectors);
//获取Struct中Field的OI实例,带注释
getColumnarStructObjectInspector(
      List<String> structFieldNames,
      List<ObjectInspector> structFieldObjectInspectors, List<String> structFieldComments);

7.利用ObjectInspector解析Object数据:

重点:一个ObjectInspector对象本身并不包含任何数据,它只是提供对数据的存储类型说明对数据对象操作的统一管理或者是代理。

如下几个例子:

package com.hive.udaf;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;

public class TestOI {
	
	public static void main(String[] args) {
		
		//测试 根据基本类型的ObjectInspector实例,获取基本类型数据
		testPrimitiveOI();
		//测试 根据Struct类型的ObjectInspector实例,获取基本类型数据
		testStructOI();
		//测试 根据List类型的ObjectInspector实例,获取基本类型数据
		testListOI();
		//测试 根据Map类型的ObjectInspector实例,获取基本类型数据
		testMapOI();
		//测试 根据Union类型的ObjectInspector实例,获取基本类型数据
		//testUnionOI();--待完善--UNION是什么类型
	}
	
	//测试 根据基本类型的ObjectInspector实例,获取基本类型数据
	public static void testPrimitiveOI() {
		//第一部分:全局变量初始化
		//1.基本类型工厂方法获得Java Double型的OI实例(也可以是Hive支持的Writable型)
		PrimitiveObjectInspector poi = 
				PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(
						PrimitiveCategory.DOUBLE);
		
		//第二部分:对循环读入的参数Object按全局变量OI设置的类型解析出基本类型数据值
		//2.设定参数Object 为Double值1.0
		Object p = new Double(1.0);
		
		//3.通过基本类型工具类,依据 poi设定的数据类型,读取参数Object中数据
		double value = PrimitiveObjectInspectorUtils.getDouble(p, poi);
		//3.或者通过基本类型OI实例解析参数值
		double value1 = ((DoubleObjectInspector) poi).get(p);
		
		System.out.println("PrimitiveObjectInspector: " + value + " or value1: " + value1);
	}
	
	//测试 根据Struct类型的ObjectInspector实例,获取基本类型数据
	//注:Struct是对基本类型数据的封装
	public static void testStructOI() {
		//第一部分:全局变量初始化
		//1.实例化structObjectInspector
		//1.1.Struct结构中各参数Field的 类型(Java型)指定,存放list中(也可以是Hive支持的Writable型)
		ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();
		foi.add(PrimitiveObjectInspectorFactory.javaLongObjectInspector);
		foi.add(PrimitiveObjectInspectorFactory.javaDoubleObjectInspector);
		//1.2.Struct结构中各参数Field的 名称指定,存放list中
		ArrayList<String> fname = new ArrayList<String>();
		fname.add("count");
		fname.add("sum");
		//1.3.其它类型的工厂方法获得Struct的OI实例(StandardStructObjectInspector)
		//StandardStructObjectInspector类是StructObjectInspector接口的实现类
		//都在同一个包下:org.apache.hadoop.hive.serde2.objectinspector
		StructObjectInspector soi = 
				ObjectInspectorFactory.getStandardStructObjectInspector(
						fname, foi);
		
		//2.1.跟据参数名称,从StructOI中取得该参数的封装MyField实例(
		//MyField类是StructField接口的实现类
		//MyField位置:org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.MyField
		//StructField位置:org.apache.hadoop.hive.serde2.objectinspector.StructField
		StructField countField = soi.getStructFieldRef("count");
		StructField sumField = soi.getStructFieldRef("sum");
		//2.2.进一步从参数的StructField实例取得参数的基本类型OI实例
		LongObjectInspector countFieldOI = 
				(LongObjectInspector) countField.getFieldObjectInspector();
		DoubleObjectInspector sumFieldOI = 
				(DoubleObjectInspector) sumField.getFieldObjectInspector();
		
		//第二部分:对循环读入的参数Object按全局变量OI设置的类型解析出基本类型数据值
		//注:Object[] 也是 Object 的子类
		//3.设定参数Object[] 为两个Java基本类型数据
		Object[] p = new Object[2];
		p[0] = new Long(1);
		p[1] = new Double(2.0);
		
		//4.1.通过StructOI及指定的参数类型,将Object[]中的数据按参数类型分离出
		Object pCount = soi.getStructFieldData(p, countField);
		Object pSum = soi.getStructFieldData(p, sumField);
		//4.2.通过参数的基本类型OI解析Object参数值
		long count = countFieldOI.get(pCount);
		double sum = sumFieldOI.get(pSum);
		
		System.out.println("struct {count:" + count + ", sum:" + sum + "}");
	}
	
	//测试 根据List类型的ObjectInspector实例,获取基本类型数据
	//注:List是对基本类型数据的封装
	public static void testListOI() {
		//第一部分:全局变量初始化
		//1.1.实例化listObjectInspector,并指定list元素为Java Long型
		ListObjectInspector loi = 
				ObjectInspectorFactory.getStandardListObjectInspector(
						PrimitiveObjectInspectorFactory.javaLongObjectInspector);
		//1.2.取得list元素的基本类型OI实例
		//注:也可以反过来,先取得基本类型OI实例,再取得ListOI实例
		PrimitiveObjectInspector poi = (PrimitiveObjectInspector)loi
				.getListElementObjectInspector();
		
		//第二部分:对循环读入的参数Object按全局变量OI设置的类型解析出基本类型数据值
		//注:Object[] 也是 Object 的子类
		//2.设定参数Object[] 为三个Java基本类型数据
		Object[] p = new Object[3];
		p[0] = new Long(0);
		p[1] = new Long(1);
		p[2] = new Long(2);
		
		//3.通过ListOI循环分离出各元素基本类型数据
		Object o = null;
		long value;
		int listLen = loi.getListLength(p);
		System.out.print("list {");
		for (int i = 0; i < listLen; i++) {
			//3.1.通过ListOI循环分离出各元素基本类型数据
			o = loi.getListElement(p, i);
			//3.2.通过基本类型工具类,依据 poi设定的数据类型,读取参数Object中数据
			value = PrimitiveObjectInspectorUtils.getLong(o, poi);
			//3.2.或者通过基本类型OI实例解析参数值
			//value = ((LongObjectInspector) poi).get(o);
			System.out.print(value);
			if (i < listLen - 1) System.out.print(",");
		}
		System.out.println("}");
        
	}
	
	//测试 根据Map类型的ObjectInspector实例,获取基本类型数据
	//注:List是对基本类型数据的封装
	public static void testMapOI() {
		//第一部分:全局变量初始化
		//1.1.取得map键值对key/value的基本类型OI实例
		LongObjectInspector keyLongOI = (LongObjectInspector)PrimitiveObjectInspectorFactory
				.getPrimitiveJavaObjectInspector(PrimitiveCategory.LONG);
		DoubleObjectInspector valueDoubleOI = (DoubleObjectInspector)PrimitiveObjectInspectorFactory
				.getPrimitiveJavaObjectInspector(PrimitiveCategory.DOUBLE);
		//1.2.实例化mapObjectInspector
		MapObjectInspector moi = 
				ObjectInspectorFactory.getStandardMapObjectInspector(
						keyLongOI, valueDoubleOI);
		
		//第二部分:对循环读入的参数Object按全局变量OI设置的类型解析出基本类型数据值
		//2.设定参数Object 为两个键值对的Java Map类型数据
		Map<Long, Double> pmap = new HashMap<Long, Double>();
		pmap.put(new Long(1), new Double(1.1));
		pmap.put(new Long(2), new Double(2.2));
		Object p = pmap;
		
		//3.通过mapOI循环分离出键值对元素基本类型数据
		Map<Object, Object> map = (Map<Object, Object>)moi.getMap(p);
		Iterator<Object> it = map.keySet().iterator();
		Object oKey, oValue;
		long key; 
		double value;
		System.out.print("map {");
		while (it.hasNext()) {
			oKey = it.next();
			oValue = map.get(oKey);
			//通过键值对基本类型OI解析参数值
			key = keyLongOI.get(oKey);
			value = valueDoubleOI.get(oValue);
			System.out.print("\"" + key + "\":" + "\"" + value + "\" ");
		}
		System.out.print("}");
        
	}
	
}

输出结果:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
PrimitiveObjectInspector: 1.0 or value1: 1.0
struct {count:1, sum:2.0}
list {0,1,2}
map {"1":"1.1" "2":"2.2" }

注1:程序本地运行时只能设置称Java类型的数据(Writable类型会报错,缺少很多class),当然这只是做测试案例用,实际开发中是能用Writable的。

注2:需要本地导入slf4j-api-1.7.26.jar,否则报错: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory。

 

 

 类似资料: