Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify unshimmed classes are bitwise-identical #3645

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions dist/scripts/binary-dedupe.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,43 @@ mv "$SPARK3XX_COMMON_DIR" $PARALLEL_WORLDS_DIR/
# spark30x-common
# spark31x-common
# spark32x-common

# Verify that all class files in the conventional jar location are bitwise
# identical regardless of the Spark-version-specific jar.
#
# At this point the duplicate classes have not been removed from version-specific jar
# locations such as parallel-world/spark312.
# For each unshimmed class file look for all of its copies inside /spark3* and
# and count the number of distinct checksums. There are two representative cases
# 1) The class is contributed to the unshimmed location via the unshimmed-from-each-spark3xx list. These are classes
# carrying the shim classifier in their package name such as
# com.nvidia.spark.rapids.spark312.RapidsShuffleManager. They are by unique by construction,
# and will have zero copies in any non-spark312 shims. Although such classes are currently excluded from
# being copied to the /spark312 Parallel World we keep the algorithm below general without assuming this.
#
# 2) The class is contributed to the unshimmed location via unshimmed-common. These are classes that
# that have the same package and class name across all parallel worlds.
#
# So if the number of distinct class files per class in the unshimmed location is < 2, the jar
# is content is as expected
#
# If we find an unshimmed class file occurring > 1 we fail the build and the code must be refactored
# until bitwise-identity of each unshimmed class is restored.

# Determine the list of unshimmed class files
UNSHIMMED_LIST_TXT=unshimmed-result.txt
find . -name '*.class' -not -path './'$PARALLEL_WORLDS_DIR/'spark*' | \
tgravescs marked this conversation as resolved.
Show resolved Hide resolved
cut -d/ -f 3- | sort > $UNSHIMMED_LIST_TXT

for classFile in $(< $UNSHIMMED_LIST_TXT); do
DISTINCT_COPIES=$(find . -path "./*/$classFile" -exec md5sum {} + |
cut -d' ' -f 1 | sort -u | wc -l)
((DISTINCT_COPIES == 1)) || {
echo >&2 "$classFile is not bitwise-identical, found $DISTINCT_COPIES distincts";
exit 2;
}
done

# Remove unshimmed classes from parallel worlds
xargs --arg-file="$UNSHIMMED_LIST_TXT" -P 4 -n 100 -I% \
find . -path "./$PARALLEL_WORLDS_DIR/spark*/%" -exec rm {} +