ORCFile
2019-08-27 本文已影响0人
背麻袋的袋鼠
一.读写代码
=========================== 写入 ============================
Configuration conf = new Configuration();
conf.set("hive.exec.orc.default.row.index.stride","1000");
TypeDescription schema = TypeDescription.createStruct()
.addField("int_value", TypeDescription.createInt())
.addField("long_value", TypeDescription.createLong())
.addField("double_value", TypeDescription.createDouble())
.addField("float_value", TypeDescription.createFloat())
.addField("boolean_value", TypeDescription.createBoolean())
.addField("string_value", TypeDescription.createString());
Writer writer = OrcFile.createWriter(new Path("C:\\Users\\admin\\Desktop\\my-file.orc"),
OrcFile.writerOptions(conf)
.setSchema(schema));
VectorizedRowBatch batch = schema.createRowBatch();
LongColumnVector intVector = (LongColumnVector) batch.cols[0];
LongColumnVector longVector = (LongColumnVector) batch.cols[1];
DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[2];
DoubleColumnVector floatColumnVector = (DoubleColumnVector) batch.cols[3];
LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
BytesColumnVector stringVector = (BytesColumnVector) batch.cols[5];
for(int r=0; r < 100000; ++r) {
int row = batch.size++;
intVector.vector[row] = r;
longVector.vector[row] = r;
doubleVector.vector[row] = r;
floatColumnVector.vector[row] = r;
booleanVector.vector[row] = r< 50000 ? 1 : 0;
stringVector.setVal(row, UUID.randomUUID().toString().getBytes());
if (batch.size == batch.getMaxSize()) {
writer.addRowBatch(batch);
batch.reset();
}
}
if (batch.size != 0) {
writer.addRowBatch(batch);
batch.reset();
}
writer.close();
============================ 读取 ==============================
Configuration conf = new Configuration();
TypeDescription readSchema = TypeDescription.createStruct()
.addField("int_value", TypeDescription.createInt())
.addField("long_value", TypeDescription.createLong())
.addField("double_value", TypeDescription.createDouble())
.addField("float_value", TypeDescription.createFloat())
.addField("boolean_value", TypeDescription.createBoolean())
.addField("string_value", TypeDescription.createString());
Reader reader = OrcFile.createReader(new Path("C:\\Users\\admin\\Desktop\\my-file.orc"),
OrcFile.readerOptions(conf));
//查询满足过滤条件的批次 默认是1w
Reader.Options readerOptions = new Reader.Options(conf)
.searchArgument(
SearchArgumentFactory
.newBuilder()
.between("long_value", PredicateLeaf.Type.LONG, 0L, 10L)
// .equals("long_value",PredicateLeaf.Type.LONG,10000L)
.build(),
new String[]{"int_value","long_value","double_value","float_value","boolean_value","string_value"}
);
String s = readerOptions.toString();
System.out.println(s);
RecordReader rows = reader.rows(readerOptions.schema(readSchema));
VectorizedRowBatch batch = readSchema.createRowBatch();
while (rows.nextBatch(batch)) {
LongColumnVector intVector = (LongColumnVector) batch.cols[0];
LongColumnVector longVector = (LongColumnVector) batch.cols[1];
DoubleColumnVector doubleVector = (DoubleColumnVector) batch.cols[2];
DoubleColumnVector floatVector = (DoubleColumnVector) batch.cols[3];
LongColumnVector booleanVector = (LongColumnVector) batch.cols[4];
BytesColumnVector stringVector = (BytesColumnVector) batch.cols[5];
for (int r = 0; r < batch.size; r++) {
int intValue = (int) intVector.vector[r];
long longValue = longVector.vector[r];
double doubleValue = doubleVector.vector[r];
double floatValue = (float) floatVector.vector[r];
boolean boolValue = booleanVector.vector[r] != 0;
String stringValue = new String(stringVector.vector[r], stringVector.start[r], stringVector.length[r]);
System.out.println(intValue + "," + longValue + ", " + doubleValue + ", " + floatValue + ", " + boolValue + ", " + stringValue);
}
}
rows.close();
System.out.println(reader.rows());
二 默认参数设置
参数名 默认值 说明
hive.exec.orc.default.stripe.size 256 * 1024 * 1024 stripe的默认大小
hive.exec.orc.default.block.size 256 * 1024 * 1024 orc文件在文件系统中的默认block大小,从hive-0.14开始
hive.exec.orc.dictionary.key.size.threshold 0.8 String类型字段使用字典编码的阈值
hive.exec.orc.default.row.index.stride 10000 stripe中的分组大小
hive.exec.orc.default.compress ZLIB ORC文件的默认压缩方式
hive.exec.orc.skip.corrupt.data false 遇到错误数据的处理方式,false直接抛出异常,true则跳过该记录
三 其他
1.条件查询返回的是包含结果的所有stripes
2.stripes默认值是10000,最小是1000
3.如果查询结果中某一字符串类型的列数据完全相同,只会完整返回每个stripe组的第一条数据,其他row对应列数据为空