Skip to content

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443

Open
tanishq-chugh wants to merge 16 commits into
apache:masterfrom
tanishq-chugh:HIVE-29551
Open

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443
tanishq-chugh wants to merge 16 commits into
apache:masterfrom
tanishq-chugh:HIVE-29551

Conversation

@tanishq-chugh
Copy link
Copy Markdown
Contributor

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
logTypeWarning(colName, type);
} else {
nonPrimColNames.add(colName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

} else {
colTypes.add(type);
}
Map<String, String> colTypeMap = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

  • ColumnStatsSemanticAnalyzer#getColumnName
  • Utilities.getColumnNamesFromFieldSchema
    The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

Copy link
Copy Markdown
Contributor

@thomasrebele thomasrebele left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

return rwt;
}

private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

return new StatsEligibleColumns(colNames, colTypes);
}

private List<String> getColumnName(ASTNode tree) throws SemanticException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

Comment on lines +245 to +246
colNames.clear();
colNames.addAll(primColNames);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying the argument can be avoided when implementing my other comments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

}

protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit - 84d81f9

colNames = statsCols.columnNames();
} else {
colNames = getColumnName(ast);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

  1. Collect the column names as string using the original method
  2. Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
  3. Collect the columns as List<FieldSchema>
  4. The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

default:
if (tree.getChildCount() != 3) {
throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
+ " either 2 or 3. Found : " + tree.getChildCount());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the exception message in commit - 84d81f9

Copy link
Copy Markdown
Contributor

@abstractdog abstractdog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
in the current form, this is a good optimization and refactoring of the same area, thanks!

columnNames = getExplicitColumnNamesFromAst(ast);
}

checkForPartitionColumns(columnNames, Utilities.getColumnNamesFromFieldSchema(tbl.getPartitionKeys()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanishq-chugh , can you please check checkForPartitionColumns it also has nested for loop and validateSpecifiedColumnNames is using for loop + contains() which is also O(N*M). Both are used in getColumnsFromAst()

I haven't gone though the full PR yet, just wanted to hightlight.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this change in commit: 68446bd

@abstractdog
Copy link
Copy Markdown
Contributor

double-checking here: I’d like to know whether anyone has any further concerns or comments, my +1 still stands, I’ll wait another 24 hours unless we receive confirmation sooner

@Aggarwal-Raghav
Copy link
Copy Markdown
Contributor

@tanishq-chugh / @abstractdog , I have a question.

  1. In validateSpecifiedColumnNames we are checking if columns exists — 1 HashMap
  2. In checkForPartitionColumns we are checking for partitions columns — 1 HashSet
  3. In getFieldSchemasByColName we are getting the type of the above validated columns — 1 HashMap

Can't we do all 1 and 2 inside 3 while maintaining 1 DataStrucuture? I think it should be possible.

The optimization + refactoring in this patch is good.
Just thinking in terms of math, ColumnStatsSemanticAnalyzer will run in Query Compilation phase so If my competitive coding concepts are correct then:

For 1000 columns, O(N^2) => 1Million i.e 10^6, which modern computer it does this in 1 sec.

For columns more than 3k or so the real benefit of this optimization will kick in i guess.

@sonarqubecloud
Copy link
Copy Markdown

@tanishq-chugh
Copy link
Copy Markdown
Contributor Author

tanishq-chugh commented May 15, 2026

Hi @Aggarwal-Raghav
Thanks for pointing it out & yes, this is definitely better for optimisation. I have refactored the same in commit 37d1c22

CC: @abstractdog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants