HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge… by tanishq-chugh · Pull Request #6443 · apache/hive

tanishq-chugh · 2026-04-18T13:13:18Z

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

Aggarwal-Raghav · 2026-04-18T17:43:58Z

+        if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
+          logTypeWarning(colName, type);
+        } else {
+          nonPrimColNames.add(colName);


the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

thomasrebele · 2026-04-21T12:09:27Z

-          } else {
-            colTypes.add(type);
-          }
+    Map<String, String> colTypeMap = new HashMap<>();


Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

ColumnStatsSemanticAnalyzer#getColumnName

Utilities.getColumnNamesFromFieldSchema
The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

thomasrebele

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

thomasrebele · 2026-04-28T10:14:24Z

    return rwt;
  }

+  private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {


Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:16:38Z

+    return new StatsEligibleColumns(colNames, colTypes);
  }

  private List<String> getColumnName(ASTNode tree) throws SemanticException {


I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

thomasrebele · 2026-04-28T10:17:25Z

+    colNames.clear();
+    colNames.addAll(primColNames);


Modifying the argument can be avoided when implementing my other comments.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

thomasrebele · 2026-04-28T10:26:43Z

  }

-  protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
+  protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {


I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:44:30Z

+        colNames = statsCols.columnNames();
+      } else {
+        colNames = getColumnName(ast);
+      }


The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

Collect the column names as string using the original method

Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)

Collect the columns as List<FieldSchema>

The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

thomasrebele · 2026-04-28T10:55:41Z

-    default:
+    if (tree.getChildCount() != 3) {
      throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
          + " either 2 or 3. Found : " + tree.getChildCount());


If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Updated the exception message in commit - 84d81f9

abstractdog

LGTM
in the current form, this is a good optimization and refactoring of the same area, thanks!

Aggarwal-Raghav · 2026-05-12T08:37:20Z

+      columnNames = getExplicitColumnNamesFromAst(ast);
+    }
+
+    checkForPartitionColumns(columnNames, Utilities.getColumnNamesFromFieldSchema(tbl.getPartitionKeys()));


@tanishq-chugh , can you please check checkForPartitionColumns it also has nested for loop and validateSpecifiedColumnNames is using for loop + contains() which is also O(N*M). Both are used in getColumnsFromAst()

I haven't gone though the full PR yet, just wanted to hightlight.

Made this change in commit: 68446bd

abstractdog · 2026-05-14T05:37:53Z

double-checking here: I’d like to know whether anyone has any further concerns or comments, my +1 still stands, I’ll wait another 24 hours unless we receive confirmation sooner

Aggarwal-Raghav · 2026-05-14T07:24:35Z

@tanishq-chugh / @abstractdog , I have a question.

In validateSpecifiedColumnNames we are checking if columns exists — 1 HashMap
In checkForPartitionColumns we are checking for partitions columns — 1 HashSet
In getFieldSchemasByColName we are getting the type of the above validated columns — 1 HashMap

Can't we do all 1 and 2 inside 3 while maintaining 1 DataStrucuture? I think it should be possible.

The optimization + refactoring in this patch is good.
Just thinking in terms of math, ColumnStatsSemanticAnalyzer will run in Query Compilation phase so If my competitive coding concepts are correct then:

For 1000 columns, O(N^2) => 1Million i.e 10^6, which modern computer it does this in 1 sec.

For columns more than 3k or so the real benefit of this optimization will kick in i guess.

…ieldSchemasByColName

sonarqubecloud · 2026-05-15T02:01:56Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.3% Duplication on New Code

See analysis details on SonarQube Cloud

tanishq-chugh · 2026-05-15T10:03:47Z

Hi @Aggarwal-Raghav
Thanks for pointing it out & yes, this is definitely better for optimisation. I have refactored the same in commit 37d1c22

CC: @abstractdog

asf-ci-hive added tests pending tests passed and removed tests pending labels Apr 18, 2026

Aggarwal-Raghav reviewed Apr 18, 2026

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 19, 2026

thomasrebele reviewed Apr 21, 2026

View reviewed changes

tanishq-chugh force-pushed the HIVE-29551 branch from 4a6804d to 85c0ebe Compare April 26, 2026 19:34

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 26, 2026

thomasrebele reviewed Apr 28, 2026

View reviewed changes

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending tests unstable labels Apr 28, 2026

Refactor genRewrittenQuery to use FieldSchemas

c030988

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 12, 2026

abstractdog approved these changes May 12, 2026

View reviewed changes

Aggarwal-Raghav reviewed May 12, 2026

View reviewed changes

Refactor validateSpecifiedColumnNames & checkForPartitionColumns

68446bd

asf-ci-hive added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels May 12, 2026

asf-ci-hive added tests pending and removed tests passed labels May 14, 2026

Merge checkForPartitionColumns & validateSpecifiedColumnNames in getF…

37d1c22

…ieldSchemasByColName

tanishq-chugh force-pushed the HIVE-29551 branch from b1c2826 to 37d1c22 Compare May 14, 2026 22:40

asf-ci-hive added tests passed tests pending and removed tests pending tests passed labels May 14, 2026

asf-ci-hive added tests passed and removed tests pending labels May 15, 2026

Conversation

tanishq-chugh commented Apr 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abstractdog commented May 14, 2026

Uh oh!

Aggarwal-Raghav commented May 14, 2026

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Quality Gate passed

Uh oh!

tanishq-chugh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tanishq-chugh commented May 15, 2026 •

edited

Loading