HIVE-29598: Fix vectorized outer join wrong results due to stale scratch column values#6486
HIVE-29598: Fix vectorized outer join wrong results due to stale scratch column values#6486ryukobayashi wants to merge 1 commit into
Conversation
…tch column values
|
soumyakanti3578
left a comment
There was a problem hiding this comment.
Vectorization code can be tricky and brittle. To ensure there are no unintended consequences of this change, could you please add some tests? A minimal reproducer as a qtest is essential, and unit tests for the vector clearing operation across different ColumnVector types would also be valuable.
| private static void clearVectorValue(ColumnVector colVector, int index) { | ||
| if (colVector instanceof LongColumnVector) { | ||
| ((LongColumnVector) colVector).vector[index] = 0L; | ||
| } else if (colVector instanceof DoubleColumnVector) { | ||
| ((DoubleColumnVector) colVector).vector[index] = 0.0; | ||
| } else if (colVector instanceof BytesColumnVector) { | ||
| BytesColumnVector bcv = (BytesColumnVector) colVector; | ||
| bcv.vector[index] = null; | ||
| bcv.start[index] = 0; | ||
| bcv.length[index] = 0; | ||
| } else if (colVector instanceof TimestampColumnVector) { | ||
| ((TimestampColumnVector) colVector).setNullValue(index); | ||
| } else if (colVector instanceof IntervalDayTimeColumnVector) { | ||
| ((IntervalDayTimeColumnVector) colVector).setNullValue(index); |
There was a problem hiding this comment.
Can you please confirm that these are the only ColumnVectors that can appear in smallTableValueColumnMap and outerSmallTableKeyColumnMap?
Maybe we should do the same for all ColumnVector types, and it would be better to do this in the individual classes instead of handling it here.



What changes were proposed in this pull request?
In vectorized outer join,
generateOuterNulls()andgenerateOuterNullsRepeatedAll()setisNull[i] = trueon scratch columns but leavevector[i]untouched. Whenhive.vectorized.reuse.scratch.columns=true(the default), a scratch column slotfreed after an expression evaluation (e.g.
CastStringToLong) can be reused for the outer join's null-marking column. Afterreset()clearsisNull[], the expression overwritesvector[i]with a fresh value (e.g. 2025). Later,generateOuterNulls()sets
isNull[i] = truewithout clearingvector[i], leaving a stale non-zero value.Downstream operators such as
ColOrColreadvector[i]directly to distinguish "false" (== 0) from "null" (!= 0). The stale value causes null rows to be misinterpreted as "true", producing wrong OR/AND/CASE WHEN results.The fix adds
clearVectorValue(), called wheneverisNull[i]is set totruein the outer join null-marking paths, zeroingvector[i]for all supported column vector types (LongColumnVector,DoubleColumnVector,BytesColumnVector,TimestampColumnVector,IntervalDayTimeColumnVector).Why are the changes needed?
Without the fix, vectorized outer joins silently return wrong results when scratch column reuse is enabled (the default). The bug is non-obvious because it only triggers when a specific combination of conditions is met: a type-casting expression allocates a scratch column that is later reused for the outer join's null-marking column, and the join result is consumed by a boolean operator that reads the raw vector value for null discrimination. Users have no indication that results are wrong; workarounds require disabling vectorization entirely (
hive.vectorization.enabled=false) or disabling scratch column reuse (hive.vectorized.reuse.scratch.columns=false), both of which carry a significant performance cost.Does this PR introduce any user-facing change?
No
How was this patch tested?
The existing TestMapJoinOperator suite (17 tests) passes without regression. The bug can also be verified manually with the minimal SQL reproducer below; with the fix applied, the result matches the expected output (C 2026 new, D 2026 new) that was previously only obtainable by disabling vectorization.