Skip to content

Commit

Permalink
Implement a simplified version of from_json (NVIDIA#844)
Browse files Browse the repository at this point in the history
* Add `MapUtils.java`

* Add clang-format style

* Fix comment

* Add empty files

* Fix compile issue and update clang-format

* Add Java test

* Concatenate the input json strings

* Misc

* Misc

* Print debug

* Update Java test

* Add more test

* Implement several more computation

* Add comments

* Implement node-to-token-index map

* Compute node range from node indices, not token indices

* Extract node ranges for key-value of non-nested types

* Add more pairs to node section

* Get node ranges for nested nodes

* Extract json key-value pairs

* Extract parent node ids of keys

* Compute offsets for the output lists

* Fix offsets computation

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Print debug for the output

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* More efficient substring operation

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Update Java test

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Remove parameter

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Rewrite docs

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Rewrite for easier benchmark

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Extract out functions

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Refactor and cleanup

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Handle empty and nulls input rows

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Update Java test

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Cleanup headers

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Implement UTF-8 support

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Add Java test

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Fix error

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Move header into .cu file

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Update copyright headers

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Update function name

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Add `assert`

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Remove wrong comment

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Extract debug code into a separate header

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Simplify `output_size` computation

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Fix typo

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Cleanup unused variable

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Fix a bug

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Rename variable

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Print debug input when error

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Change the error message

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Optimize error report

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

* Change comment

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>

Signed-off-by: Nghia Truong <nghiatruong.vn@gmail.com>
  • Loading branch information
ttnghia authored Jan 11, 2023
1 parent 656d4f5 commit ab5f52d
Show file tree
Hide file tree
Showing 8 changed files with 1,217 additions and 4 deletions.
204 changes: 204 additions & 0 deletions src/main/cpp/.clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
# Reference: https://clang.llvm.org/docs/ClangFormatStyleOptions.html
Language: Cpp
# BasedOnStyle: LLVM
# no indentation (-2 from indent, which is 2)
AccessModifierOffset: -2
AlignAfterOpenBracket: Align
# int aaaa = 12;
# int b = 23;
# int ccc = 23;
# leaving OFF
AlignConsecutiveAssignments: false
# int aaaa = 12;
# float b = 23;
# std::string ccc = 23;
# leaving OFF
AlignConsecutiveDeclarations: false
##define A \
# int aaaa; \
# int b; \
# int dddddddddd;
# leaving ON
AlignEscapedNewlines: Right
# int aaa = bbbbbbbbbbbbbbb +
# ccccccccccccccc;
# leaving ON
AlignOperands: true
# true: false:
# int a; // My comment a vs. int a; // My comment a
# int b = 2; // comment b int b = 2; // comment about b
# leaving ON
AlignTrailingComments: true
# squeezes a long declaration's arguments to the next line:
#true:
#void myFunction(
# int a, int b, int c, int d, int e);
#
#false:
#void myFunction(int a,
# int b,
# int c,
# int d,
# int e);
# leaving ON
AllowAllParametersOfDeclarationOnNextLine: true
# changed to ON, as we use short blocks on same lines
AllowShortBlocksOnASingleLine: true
# set this to ON, we use this in a few places
AllowShortCaseLabelsOnASingleLine: true
# set this to ON
AllowShortFunctionsOnASingleLine: Inline
AllowShortIfStatementsOnASingleLine: false
AllowShortLoopsOnASingleLine: false
# Deprecated option.
# PenaltyReturnTypeOnItsOwnLine applies, as we set this to None,
# where it tries to break after the return type automatically
AlwaysBreakAfterDefinitionReturnType: None
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: MultiLine

# if all the arguments for a function don't fit in a single line,
# with a value of "false", it'll split each argument into different lines
BinPackArguments: true
BinPackParameters: true

# if this is set to Custom, the BraceWrapping flags apply
BreakBeforeBraces: Custom
BraceWrapping:
AfterClass: false
AfterControlStatement: false
AfterEnum: false
AfterFunction: false
AfterNamespace: false
AfterObjCDeclaration: false
AfterStruct: false
AfterUnion: false
AfterExternBlock: false
BeforeCatch: false
BeforeElse: false
IndentBraces: false
SplitEmptyFunction: false
SplitEmptyRecord: false
SplitEmptyNamespace: false

# will break after operators when a line is too long
BreakBeforeBinaryOperators: None
# not in docs.. so that's nice
BreakBeforeInheritanceComma: false
# This will break inheritance list and align on colon,
# it also places each inherited class in a different line.
# Leaving ON
BreakInheritanceList: BeforeColon

#
#true:
#veryVeryVeryVeryVeryVeryVeryVeryVeryVeryVeryLongDescription
# ? firstValue
# : SecondValueVeryVeryVeryVeryLong;
#
#false:
#veryVeryVeryVeryVeryVeryVeryVeryVeryVeryVeryLongDescription ?
# firstValue :
# SecondValueVeryVeryVeryVeryLong;
BreakBeforeTernaryOperators: false

BreakConstructorInitializersBeforeComma: false
BreakConstructorInitializers: BeforeColon
BreakAfterJavaFieldAnnotations: true
BreakStringLiterals: true
# So the line lengths in cudf are not following a limit, at the moment.
# Usually it's a long comment that makes the line length inconsistent.
# Command I used to find max line lengths (from cpp directory):
# find include src tests|grep "\." |xargs -I{} bash -c "awk '{print length}' {} | sort -rn | head -1"|sort -n
# I picked 100, as it seemed somewhere around median
ColumnLimit: 100
# TODO: not aware of any of these at this time
CommentPragmas: '^ IWYU pragma:'
# So it doesn't put subsequent namespaces in the same line
CompactNamespaces: false
ConstructorInitializerAllOnOneLineOrOnePerLine: false
ConstructorInitializerIndentWidth: 4
ContinuationIndentWidth: 4
# TODO: adds spaces around the element list
# in initializer: vector<T> x{ {}, ..., {} }
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
ExperimentalAutoDetectBinPacking: false
# } // namespace a => useful
FixNamespaceComments: true
ForEachMacros:
- foreach
- Q_FOREACH
- BOOST_FOREACH
IncludeBlocks: Regroup
IncludeCategories:
- Regex: '<[[:alnum:]]+>'
Priority: 0
- Regex: '<[[:alnum:].]+>'
Priority: 1
- Regex: '<.*>'
Priority: 2
- Regex: '.*/.*'
Priority: 3
- Regex: '.*'
Priority: 4
# if a header matches this in an include group, it will be moved up to the
# top of the group.
IncludeIsMainRegex: '(Test)?$'
IndentCaseLabels: true
IndentPPDirectives: None
IndentWidth: 2
IndentWrappedFunctionNames: false
JavaScriptQuotes: Leave
JavaScriptWrapImports: true
KeepEmptyLinesAtTheStartOfBlocks: true
MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBinPackProtocolList: Auto
ObjCBlockIndentWidth: 2
ObjCSpaceAfterProperty: false
ObjCSpaceBeforeProtocolList: true

# Penalties: leaving unchanged for now
# https://stackoverflow.com/questions/26635370/in-clang-format-what-do-the-penalties-do
PenaltyBreakAssignment: 2
PenaltyBreakBeforeFirstCallParameter: 19
PenaltyBreakComment: 300
PenaltyBreakFirstLessLess: 120
PenaltyBreakString: 1000
PenaltyBreakTemplateDeclaration: 10
PenaltyExcessCharacter: 1000000
# As currently set, we don't see return types being
# left on their own line, leaving at 60
PenaltyReturnTypeOnItsOwnLine: 60

# char* foo vs char *foo, picking Right aligned
PointerAlignment: Right
ReflowComments: true
# leaving ON, but this could be something to turn OFF
SortIncludes: true
SortUsingDeclarations: true
SpaceAfterCStyleCast: false
SpaceAfterTemplateKeyword: true
SpaceBeforeAssignmentOperators: true
SpaceBeforeCpp11BracedList: false
SpaceBeforeCtorInitializerColon: true
SpaceBeforeInheritanceColon: true
SpaceBeforeParens: ControlStatements
SpaceBeforeRangeBasedForLoopColon: true
SpaceInEmptyParentheses: false
SpacesBeforeTrailingComments: 1
SpacesInAngles: false
SpacesInContainerLiterals: true
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp11
TabWidth: 8
UseTab: Never
...
10 changes: 6 additions & 4 deletions src/main/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -140,15 +140,17 @@ set(CUDFJNI_INCLUDE_DIRS

add_library(
spark_rapids_jni SHARED
src/RowConversionJni.cpp
src/CastStringJni.cpp
src/DecimalUtilsJni.cpp
src/MapUtilsJni.cpp
src/NativeParquetJni.cpp
src/RowConversionJni.cpp
src/ZOrderJni.cpp
src/cast_string.cu
src/cast_string_to_float.cu
src/row_conversion.cu
src/DecimalUtilsJni.cpp
src/decimal_utils.cu
src/ZOrderJni.cpp
src/map_utils.cu
src/row_conversion.cu
src/zorder.cu
)

Expand Down
35 changes: 35 additions & 0 deletions src/main/cpp/src/MapUtilsJni.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf_jni_apis.hpp>
#include <dtype_utils.hpp>

#include "map_utils.hpp"

extern "C" {

JNIEXPORT jlong JNICALL Java_com_nvidia_spark_rapids_jni_MapUtils_extractRawMapFromJsonString(
JNIEnv *env, jclass, jlong input_handle) {
JNI_NULL_CHECK(env, input_handle, "json_column_handle is null", 0);

try {
cudf::jni::auto_set_device(env);
auto const input = reinterpret_cast<cudf::column_view const *>(input_handle);
return cudf::jni::ptr_as_jlong(spark_rapids_jni::from_json(*input).release());
}
CATCH_STD(env, 0);
}
}
Loading

0 comments on commit ab5f52d

Please sign in to comment.