Partial Projection Support

In Gluten, there is still a gap in supporting all Spark expressions natively (e.g., some JSON functions or Java UDFs). In this case, Gluten will choose the JVM code path to run the expressions, which can introduce performance regressions.

Partial projections, which allow Gluten to minimal data copy between JVM and C++, were added to avoid these performance regressions.

Detailed Implementations

Adding Partial Projection for UDF

For example, with the expression hash(udf(col0)), col1, col2, col3, col4, partial projection allows us to convert only col0 to row or column to Arrow as input, and convert udf(col0) as an alias partialProject1_. Then, ProjectExecTransformer will handle hash(partialProject1_), col1, col2, col3, col4, partialProject1_. This feature saves the cost of converting the columnar format to row format and vice-versa.

Adding Partial Projection for Unsupported Expressions

The partial projection feature can also benefit from expressions that are not natively supported. For example, substr(from_json(col_a)). Since from_json is not fully supported, Gluten may use the JVM code path. Instead of projecting the whole expression, partial projection will attempt to project from_json() and perform a native projection of substr().

In the case of blacklisted expressions defined in spark.gluten.expression.blacklist, this feature is also beneficial.

Limitations

This feature is in the preliminary stages of development and will be improved in future updates.


Back to top

Copyright © 2024 The Apache Software Foundation, Licensed under the Apache License, Version 2.0. Apache Gluten, Gluten, Apache, the Apache feather logo, and the Apache Gluten project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other marks mentioned may be trademarks or registered trademarks of their respective owners.

Apache Gluten is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Privacy Policy

This site uses Just the Docs, a documentation theme for Jekyll.