Skip to content

HIVE-29618: fix multiline string matches for vectorization too (follows up on HIVE-22008)#6494

Open
konstantinb wants to merge 1 commit into
apache:masterfrom
konstantinb:HIVE-29618
Open

HIVE-29618: fix multiline string matches for vectorization too (follows up on HIVE-22008)#6494
konstantinb wants to merge 1 commit into
apache:masterfrom
konstantinb:HIVE-29618

Conversation

@konstantinb
Copy link
Copy Markdown
Contributor

@konstantinb konstantinb commented May 15, 2026

What changes were proposed in this pull request?

HIVE-29618: fix multiline string matches for vectorization too (follows up on HIVE-22008)

Add the Pattern.DOTALL flag to the vectorized LIKE operator's ComplexChecker regex compile (AbstractFilterStringColLikeStringScalar.java:422), restoring parity with the non-vectorized
UDFLike (UDFLike.java:194, fixed by HIVE-22008 in 2019). Extend udf_like.q with a regression test for the COMPLEX/regex path and run every statement once more with
hive.vectorized.execution.enabled=false to lock in cross-mode parity. Add a focused Java unit test in TestVectorStringExpressions.

Why are the changes needed?

d7475aa
HIVE-22008 fixed non-vectorized LIKE to match multi-line input but missed the parallel vectorized implementation. Patterns containing an unescaped _ route through ComplexChecker, which compiled
its regex without DOTALL; the resulting . (from _) and .*? (from %) cannot cross newlines, so vectorized LIKE silently drops rows whose values contain \n. The result is silently wrong
query output — the same LIKE query returns different row counts in vectorized vs non-vectorized mode.

Does this PR introduce any user-facing change?

Yes, as a bug fix. Vectorized LIKE patterns containing _ now match multi-line strings consistently with non-vectorized execution and with SQL semantics. Queries that previously dropped rows now
return them. No syntax, API, or config changes.

How was this patch tested?

  • New unit test TestVectorStringExpressions#testStringLikeComplexCheckerMultiLine — fails before the fix, passes after.
    • udf_like.q extended; the regenerated udf_like.q.out differs from the pre-fix snapshot by a single character (02) at the new test's vectorized-mode result line — every other expected
      output is unchanged.
    • mvn checkstyle:check -pl ql clean.
    • Manually verified the diff produced no other behavior changes in the udf_like golden.

POSTHOOK: type: QUERY
POSTHOOK: Input: default@splitlinesunderscore
#### A masked pattern was here ####
2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the value was 0 before the code changes of this PR

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants